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Abstract 


Diverse  algorithms  have  been  proposed  and  implemented  which 
allow  machine  recognition  of  the  similarity  or  equivalence 
of  two  different  character  string  representations  of  a 
single  word.  In  CAI  (Computer  Assisted  Instruction)  these 
can  be  used  to  recognize  student  responses  as  alternate  or 
incorrect  spellings  of  a  target  word  specified  by  a  course 
author.  Experiments  were  performed  in  order  to  compare  the 
accuracy  of  some  available  approximate  string  matching 
functions  and  to  develop  an  optimized  function  suitable  for 
response  analysis  in  CAI.  A  version  of  the  edit  distance 
algorithm,  with  edit  costs  for  characters  dependent  on  the 
probability  of  corruption  of  the  character,  was  found  to  be 
superior  for  the  sample  of  data  used.  The  use  of  approximate 
string  matching  with  response  markup  and  dictionary  features 
is  discussed. 
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A  rationale  for  writing  a  dissertation  (besides  the 
obvious  one)  should  have  something  to  do  with  the 
desire  to  shorten  someone  else's  path  to  a  given 
point  so  that  he  may  press  on  to  more  interesting 
horizons . 
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I.  The  Approximate  String  Matching  Problem 

In  computer  assisted  instruction  (CAI),  the  analysis  of 
student  responses  requires  a  test  which  determines  if  a 
response  belongs  to  a  defined  set.  Consider  the  case  where 
the  student  response  is  a  string  of  characters  representing 
an  English  word.  The  problem  facing  a  course  author  using  a 
conventional  programming  language  is  how  to  define,  without 
resorting  to  the  unpleasant  method  of  enumeration,  the  set 
of  character  strings  which  he  will  accept  as  representing 
the  word  he  is  expecting.  The  set  may  include  alternate 
spellings,  incorrect  spellings,  differences  in  letter  case, 
differences  in  tense,  and  possibly  synonyms;  all  of  which 
are  equivalent  for  the  purposes  of  the  course  author. 

Specialized  CAI  authoring  languages  usually  allow  an 
author  to  define  sets  of  strings  by  using  some  convenient 
and  economical  notation.  For  example,  in  Coursewriter  II,  an 
early  CAI  language  (IBM,  1968),  authors  could  use  the 
notation  sp*l*  to  indicate  a  large  set  of  acceptable  strings 
such  as  spell,  spill,  spwlx ,  spqlrst,  and  spelling  but 
excluding  strings  like  abode,  spaal 1 ,  and  spel .  The 
Coursewriter  interpreter  checked  every  ith  character  in  the 
authors  notation  with  the  ith  character  in  the  student 
response.  All  characters  but  asterisks  were  required  to 
match  exactly.  Asterisks  in  the  final  position  could  match 
with  any  substring,  but  other  asterisks  could  match  only  one 
of  any  character. 
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Mentor,  an  early  system  supporting  instructional 
dialogues  (Feurzeig,  1969),  recognized  certain  student 
responses  as  misspellings  of  a  word  given  by  an  author. 
Feurzeig  provides  an  excerpt  from  a  dialogue  teaching 
medical  diagnosis  where  mentor  accepts  'cuogh'  as  an 
alternative  to  'cough'. 

The  success  of  any  approximate  string  matching 
algorithm  depends  on  how  closely  the  set  of  responses 
acceptable  to  the  algorithm  corresponds  to  the  set 
acceptable  to  the  author.  A  more  formal  statement  of  this 
criterion  requires  the  definition  of  a  few  terms.  Consider  a 
function  f  which  accepts  two  arguments:  t,  a  "target"  string 
supplied  by  the  author,  and  s,  a  string  parsed  from,  or  by 
itself  constituting,  a  student's  entered  response,  f  returns 
a  one  if  the  two  strings  match  or  a  zero  otherwise.  Over  the 
lifetime  of  f  in  some  CAI  environment,  t  will  take  on  n 
values  or  attributes  which  may  be  together  represented  as 
the  vector  T  made  up  of  t1f  t2,  .  ..,  tj,  ...,  tn.  For  each 
member  of  T  there  will  be  m,  occurrences  of  s  corresponding 
to  the  mj  occasions  that  f  is  invoked  with  the  argument  t  j  ■. 
So,  the  values  s  takes  on  may  be  represented  by  the  matrix  S 
as  in  Figure  1.  The  total  number  of  times  f  is  invoked  may 

n 

be  expressed  as  I  mj.  The  frequency  that  a  one  is  returned 

j  =  i 

will  be: 

n  m 

Z  Z>  f ( s  j j , t  J ) 

j  ■  i  i  *  t 
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Figure  1:  Author  and  Student  Strings 
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author  strings 


student  strings 


The  frequency  that  a  zero  is  returned  will  be: 


n  n  m 

L  m j  -  I  Ij  f (si j ,t j ) 
j  =  1  j  =  1  i  =  1 


If  an  author,  given  tj  and  su  were  able  to  personally 

n 

perform  all  Z  m,  judgements,  one  might  say  that  his 

j  =  i 

behavior  defined  a  function  called  f'.1  The  frequency  of 
agreement  and  disagreement  between  f  and  f '  is  presented  in 
Figure  2  as  a  2x2  table.  When  f  returns  a  one  and  f'  returns 
a  zero,  f  is  making  a  type  I  error  the  frequency  of  which  is 
given  in  the  lower  left  quadrant.  When  f  returns  a  zero  and 
f'  returns  a  one,  f  is  making  a  type  II  error,  the  frequency 
of  which  is  given  in  the  upper  right  quadrant. 


1  As  in  Thorelli's  (1962)  description  of  "man  as  a  secondary 
machine”  in  relation  to  the  problem  of  error  correction  in 
text . 
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Figure  2:  Error  Frequencies 
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One  might  also  present  the  problem  by  means  of  a  venn 
diagram  (Figure  3)  where  the  outer  circle  S  represents  the 
set  of  all  student  responses.  For  every  element  of  T,  f 
defines  a  subset  of  S  called  Sj  consisting  of  those 
responses  which  cause  f  to  return  a  one.  Similarily,  for 
each  element  of  T,  f'  will  define  a  subset  Sj.  Therefore, 
Sj-(.SjGS'j)  represents  the  student  strings  resulting  in  type 
I  error  and  Sj-(SjflSj)  represents  those  producing  a  type  II 
error  for  the  jth  member  of  T. 

The  problem  of  designing  an  optimal  approximate  string 
matching  function  for  a  particular  CAI  application  is  the 
problem  of  maximizing  the  agreement  between  that  function, 
f,  and  the  author's  judgement,  f',  while  minimizing  some 
appropriate  balance  of  type  I  and  type  II  errors.  The 
purpose  of  this  study  was  to  determine  which  algorithms  most 
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effectively  resolve  this  problem  and  are  thus  most  deserving 
of  inclusion  in  CAI  languages  and  response  analysis  systems. 

Chapter  I  is  a  review  of  some  approximate,  string 
matching  algorithms  found  in  CAI  and  other  applications. 
Chapter  II  reports  on  experiments  comparing  some  of  the 
algorithms  reviewed,  and  evaluations  of  modifications 
introduced  by  the  present  author.  In  Chapter  III, 
consideration  is  given  to  applying  the  findings  of  Chapter 
II  to  the  design  of  CAI  languages  and  response  analysis 
systems.  Related  problems  such  as  the  determination  of 
equivalence  of  algebraic  statements  or  of  English  sentences 
are  not  within  the  scope  of  this  thesis. 

In  their  review  of  string  matching  techniques  applied 
to  information  retrieval  from  data  bases,  Hall  and  Dowling 
(1980)  established  a  number  of  definitions  and  conventions 
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which  are  followed  here  wherever  they  are  relevant  to 
response  analysis.  Of  notable  importance,  is  their 
distinction  between  string  similarity  and  string 
equivalence.  These  and  other  definitions  are  cast  into  a 
form  applicable  to  the  general  problem  outlined  above. 

Given  strings  t  and  s,  a  string  similarity  function 
returns  a  value  rst  which  represents  the  proximity  of  t  and 
s.  The  definition  of  proximity  is  determined  by  the 
particular  similarity  function  at  hand.  Similarity  functions 
are  characterized  by  reflexivity  (rst=rst)  and  symmetry 
(rltsrt,),  but  not  transitivity.  It  is  frequently  useful  to 
force  a  binary  result  on  a  similarity  function  by  setting  a 
threshold  R  on  rst  to  produce  a  new  similarity  relation 
r’st.  In  this  case  r'st  indicates  whether  s  is  a  member  of 
Sj  which  includes  and  is  specified  by  tj. 

Equivalence  functions  are  a  subclass  of  similarity 

functions  which  return  a  binary  result  and  are  transitive 

(where  x  is  a  third  string,  if  rst=rtx  then  rst=rsx). 

According  to  Hall  &  Dowling,  this  implies  that: 

The  equivalence  relation  divides  the  set  S  of  all 
strings  into  subsets  S1f  S2,  S3,  ...  such  that  all 
strings  in  a  subset  are  equivalent  to  each  other  and 
not  equivalent  to  any  string  in  any  other  subset. 

These  subsets  are  called  "equivalence  classes". 

Therefore,  when  f  is  an  equivalence  function,  Sj  is  the 

equivalence  class  determined  by  tj. 

There  exists  a  superclass  of  similarity  functions, 
approximate  string  matching  functions,  whose  members  are 
constrained  by  neither  transitivity  nor  symmetry.  The 
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matching  facility  in  Coursewriter  illustrated  above  falls 
only  into  this  common  class. 

String  Equivalence  Functions 

The  Soundex  method  (Table  1),  hereafter  referred  to  as 
SOUNDEX,  qualifies  as  the  earliest  and  most  widely  known 
string  equivalence  function.  The  algorithm  given  in  Table  1 
is  apparently  the  product  of  modifications  made  at  the 
Remington-Rand  corporation  to  a  system  for  filing  documents 
originally  patented  in  1918  by  R.C.  Russell.  In  Russell’s 
system2  a  code  for  a  name  was  created  by  subjecting  all 
characters  except  that  in  the  initial  position  to  a  series 
of  simple  transformations.  All  instances  of  hf  vj ,  final  sf 
final  z,  or  the  digraph  gh  were  deleted.  The  remaining 
characters  were  replaced  by  a  digit  which  grouped  together 
letters  representing  similar  sounds  as  in  Table  1.  Finally, 
identical  adjacent  digits  and  all  but  the  initial  instance 
of  the  digit  representing  vowels  were  deleted.  Later 
modifications  introduced  truncation  to  four  characters  and 
deletion  of  all  instances  of  the  vowel  digit,  but  abandoned 
the  initial  deletion  of  final  s,  final  z,  and  gh. 

Hall  and  Dowling  refer  to  the  SOUNDEX  code,  and 

comparable  codes  in  other  equivalence  functions,  as 

canonical  forms  which  define  an  equivalence  class.  The 

canonical  form  may  be  thought  of  as  the  minimally 

informative  string  from  which  all  members  of  the  equivalenc 

2  Two  patents  were  registered  with  the  U.S.  patent  office, 
number  1261167  in  1918  and  number  1435663  in  1922. 
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Table  1:  The  SOUNDEX  Algorithm 


Do  the  following  operations  on  t  and  s  independently 
to  generate  their  canonical  forms: 

1.  Set  the  first  chararacter  in  the  canonical  form 
to  be  the  first  letter  in  the  original  string. 

2.  Use  this  chart  to  replace  the  letters  in  the 
original  string  with  the  corresponding  digits. 


0 

a, 

o  t 

i , 

Or 

Ur 

hf 

iA/,  y 

1 

b. 

fr 

Pf 

V 

2 

c. 

Qr 

jr 

Q' 

Sf 

X 

N 

3  d,  t 

4  / 

5  mf  n 

6  r 

3.  Delete  all  zeros. 

4.  Delete  all  repeated  adjacent  digits. 

5.  Truncate  to  four  characters. 

If  t  and  s  have  the  same  canonical  form  then  they 
are  equivalent;  otherwise  they  are  not  equivalent. 


Examples:  Counterexamples: 


mach i ne 

M2  5 

br idge 

=> 

B623 

masheen 

M2  5 

br  ige 

=4 

B62 

f i 1  ament 

F453 

decision 

D25 

f ixture 

=> 

F236 

disown 

=> 

D25 

class  it  represents  can  be  generated.  Although  any  member  of 
the  equivalence  class  can  generate  all  the  other  members, 
all  except  the  canonical  form  have  information  which  is 
redundent  to  that  task.  The  loss  of  information  in  the 
derivation  of  the  canonical  form  is  manifested  as  a 
reduction  in  string  length,  a  reduction  in  alphabet  size,  or 
both.  In  the  case  of  SOUNDEX,  there  is  shrinkage  in  both  the 
alphabet  size  (26  letters  to  6  digits  in  all  but  the  first 
character)  and  in  the  length  (resulting  from  truncation  and 
deletion  of  characters). 
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Refering  to  Figure  3,  one  might  restate  the  problem  of 
getting  Sj  to  approach  S)  as  that  of  inventing  rules  for  the 
derivation  of  a  canonical  form  such  that  only  information 
which  discriminates  between  the  elements  of  S-  is  discarded. 
One  characteristic  which  the  elements  of  S)  are  likely  to 
share  in  common  is  their  phonetic  interpretation.  Masters 
(1927)  concluded  that  in  his  sample  of  13,183  misspellings 
of  278  words,  65%  were  "possible  spellings  from  a  phonetic 
point  of  view".  This  suggests  that,  where  S ■  consists  of  a 
word  and  its  misspellings,  a  function  is  needed  which 
generates  canonical  forms  which  are  representations  of  the 
way  words  sound.  This  is  the  rationale  behind  many  of  the 
functions  discussed  here. 

Among  the  misspellings  which  can  confound  matching 
functions  relying  on  phonetic  analysis  techniques  are  those 
resulting  from  incorrect  or  alternate  pronunciations. 

Masters  identified  an  additional  14%  of  his  sample  as 
follows : 

Misspellings  which,  though  they  cannot  be  pronounced 
exactly  like  the  correct  form,  are  approximate 
phonetic  spellings;  which  are  possible  phonetic 
spellings  for  common  mi spronounc iat ions  of  the 
words;  or  in  which  the  necessary  change  in  the 
pronounc iat ion  of  this  form  is  so  slightly  different 
from  the  exact  pronounc iat ion  of  the  correct  form 
that  it  is  scarcely  detectable  by  the  uncritical 
ear . 

A  matching  function  based  on  an  algorithm  which  generates  a 
precise  phonetic  representation  from  an  orthographic 
representation  may  fail  to  recognize  misspellings  falling 
into  this  latter  category. 
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Hewes  and  Stowe  (1965)  point  out  that  SOUNDEX  attempts 
to  solve  this  problem  by  grouping  together  letters  commonly 
representing  frequently  confused  phonemes  under  the  same 
code  element.  The  bilabial  stops  (/P/  and  /B/)  are  thrown 
together  with  the  labiodental  fricatives  (/F/  and  /V/).3 
Since  h  is  deleted,  the  alveolar  stops  (/T/  and  /D/)  exist 
in  the  same  group  as  the  dental  fricatives  (the  initial 
consonants  in  thigh  and  thy).  The  often  confused  nasal 
resonants  (/M/,  /N/,  and  /ING/)  are  identified  by  a  single 
code . 

Because  they  are  rarely  confused  by  native  English 
speakers,  the  lateral  and  median  alveolar  resonants  (/L/  and 
/R/)  were  allowed  separate  code  elements.  The  velar  stops 
(/K/  and  /G/)  are  combined  with  the  affricated  alveopalatal 
stops  (/CH/  and  /J/)  and  the  groove  fricatives  (/S/,  /Z/, 
/SH/,  and  the  first  consonant  in  azure)  because  English 
orthography  fails  to  preserve  any  simple  consistent 
distinction  between  these  phonemes. 

Although  the  notorious  variability  of  vowel 
pronounc iat ion  may  justify  the  mapping  of  all  vowels  to  a 
single  code  element,  it  would  not  seem  to  support  the 
elimination  of  that  element  from  the  final  canonical  form. 
One  can  probably  assume  that  the  characters  Y,  Hr  and  W  are 
mapped  to  the  vowel  code  element  because  they  rarely  signal 

3  Phonemes  are  here  indicated  by  the  lexeme  (in  upper  case) 
with  which  the  phoneme  is  usually  associated,  enclosed  by 
slashes.  In  cases  where  this  ad  hoc  system  fails,  meaning 
has  been  clarified  by  example.  See  Gleason  (1961)  for  an 
explanation  of  phoneme  nomenclature. 
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a  consonant  except  in  the  initial  position  --  which  is  not 
transformed  in  any  case. 

If  the  vowel  code  element  is  to  be  deleted,  it  seems 
unreasonable  to  do  so  before  the  elimination  of  repeated 
adjacent  code  elements.  Hewes  and  Stowe  presented  a  method 
which  is  identical  to  that  given  in  Table  1  except  that 
steps  3  and  4  are  interchanged.  The  modified  version 
preserves  the  disjunction  of  identical  code  elements 
separated  only  by  vowels  and  will  thus  maintain  a 
distinction  between  words  like  "phases"  and  "packages". 

The  PHONETIC  option  for  response  analysis  in  the  CAI 
language  PLANIT  (Feingold,  1966;  Butler  and  Frye,  1970)  is 
illustrated  in  Table  2.  The  PLANIT  algorithm,  hereafter 
referred  to  as  PLANIT,  is  based  on  the  modified  soundex 
method  of  Hewes  and  Stowe  with  two  notable  differences: 

1.  The  initial  character  is  not  exempted  from  the 
transformation  procedure,  but  the  transformed  character 
in  this  position  is  never  deleted. 

2.  The  'H'  and  'W?  characters  are  mapped  to  a  single  code 
element  separate  from  the  vowel  code  element. 

One  justification  implied  by  Hewes  and  Stowe  for  the 
preservation  of  the  initial  character  is  that,  where  the 
SOUNDEX  code  is  serving  as  a  key  in  an  information  retrieval 
system,  each  of  the  26  alphabetic  characters  can  indicate  an 
absolute  address  from  which  a  sequential  search  for  a 
matching  code  can  proceed.  Perhaps  a  more  significant 
argument  for  the  preservation  of  the  initial  character  is 
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that  the  frequency  of  misspellings  in  the  initial  position 
is  much  lower  than  in  the  rest  of  the  word. 

If  the  initial  character  is  transformed,  it  follows 
that  h,  y  and  w  should  no  longer  be  mapped  to  the  vowel  code 
element.  In  the  initial  position  of  certain  words  of  Old 
English  origin,  the  bilabial  median  resonant  preceded  by 
breath  (/HW/  as  in  wheat)  was  actually  represented  by  the 
digraph  hw  well  into  the  13th  century.4  After  the 
introduction  of  it rh  to  represent  this  vocalization,  the 
orthography  was  further  obscured  by  the  disappearance  of  the 
preceding  /H/  in  most  modern  dialects,  and  by  the  adoption 
of  wh  to  represent  /H/  in  the  initial  position  of  words  such 
as  whom.  This  is  apparently  the  reason  in  PLANIT  for 
grouping  h  and  w  together  under  the  same  code  element.  There 
seems  to  be  no  justification  for  leaving  y  in  the  vowel 
group  rather  than  allowing  a  separate  code  element  which  is 
deleted  in  all  but  the  initial  position. 

One  can  recognize  two  fundamental  weaknesses  in  SOUNDEX 
and  its  descendents:  the  inability  to  parse  lexemes 
containing  more  than  one  character,  and  the  inflexible  way 
of  handling  ambiguity  which  requires  a  lexeme  to  be  mapped 
to  only  one  code  element.  Although  these  limitations  allow 
for  relatively  quick  execution,  it  is  debatable  whether 
modern  CAI  systems  can  afford  the  inaccuracy  they  impose. 

The  identification  of  common  digraphs  and  trigraphs 

through  a  parsing  mechanism  would  remedy  many  of  the 

4  Oxford  English  Dictionary.  London:  Oxford  University 
Press . 
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Table  2:  The  PLANIT  Algorithm 


Do  the  following  operations  on  t  and  s  independently 
to  generate  their  canonical  forms: 

1.  Use  this  chart  to  replace  the  letters  in  the 
original  string  with  the  corresponding  letters: 

a  a,  e,  i ,  or  u,  y 

b  b,  f ,  p,  v 

c  c ,  g  f  j  f  kf  q  r  sr  x ,  z 

D  d,  t 

H  h,  w 

L  7 

M  mf  n 

R  r 

2.  Except  for  the  first  character,  delete  all 
occurrences  of  H. 

3.  Delete  all  repeated  adjacent  characters. 

4.  Delete  all  occurrences  of  A. 


Examples:  Counterexamples: 


machine 

MCM 

br idge 

BRDC 

masheen 

MCM 

br  ige 

=> 

BRC 

d i sown 

=> 

DCM 

acid 

CD 

decision 

DCCM 

quiet 

CD 

problems  arising  with  the  soundex  method.  For  example:  ng 
can  be  identified  correctly  as  /ING/;  and  dg  can  be 
recognized  as  /J/,  allowing  a  successful  match  of  the  names 
Rodgers  and  Rogers .  However,  a  problem  would  still  be  posed 
by  lexemes  like  t,  which  commonly  represents  /T/  (as  in 
negative)  but  may  also  represent  /SH/  (as  in  negotiate). 

Some  method  is  clearly  needed  which  allows  several  different 
lexemes  to  be  associated  with  a  common  string  of  phonemes 
and  conversely,  several  different  phoneme  strings  to  be 
associated  with  a  common  lexeme. 
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Symonds  (1970)  proposed  an  algorithm  (Table  3), 
hereafter  referred  to  as  SYMONDS,  which  later  formed  the 
basis  for  the  cp  (compare  phonetic)  function  in  the  National 
Research  Council's  NATAL-74  (Westrom, 1 974 )  and  Honeywell's 
NATAL  II  (Honeywell , 1 98 1 )  authoring  language.  Instead  of 
generating  a  single  canonical  form,  SYMONDS  multiplied  the 
current  number  of  canonical  forms  by  the  number  of  alternate 
phonemes  associated  with  the  current  lexeme  in  the  string. 

n 

This  resulted  in  n  q-t  canonical  forms  where  q  is  the  number 

i  =  1 

of  alternate  phonemes  corresponding  to  the  lexemes  parsed 
from  the  string  and  n  is  the  number  of  these  lexemes.  If  two 
strings  shared  at  least  one  canonical  form,  they  were 
considered  equivalent.  Although  Symond's  method  can  be 
viewed  as  an  extrapolation  from  the  simpler  soundex-type 
methods,  the  generation  of  multiple  canonical  forms  destroys 
the  mutual  exclusivity  of  equivalence  classes  so,  strictly 
speaking,  the  method  cannot  be  considered  an  equivalence 
function . 

SYMONDS  appears  to  be  free  from  many  of  the  faults  and 
restrictions  of  the  soundex-like  functions;  albeit  at  the 
cost  of  significantly  greater  execution  time.  However, 
several  details  of  the  mapping  algorithm  are  questionable. 

No  example  is  given  justifying  the  phoneme  /G/  as  an 
alternate  for  the  digraph  dg.  One  would  think  the  digraph 
should  be  mapped  down  to  /J/  and  /DG/  (as  in  Edgar ) . 
Similarily,  no  example  is  given  for  the  phoneme  /SH/  as  a 
translation  of  see  or  the  phoneme  /K/  as  a  translation  of 
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Table  3:  The  SYMONDS  Algorithm 


Use  the  chart  to  independently  build  a  series  of 

canonical  forms  for  each  of  s  and  t  under  the 

following  constraints: 

1.  Vowel  characters  a,e, i,o,u  are  skipped  over  and 
are  not  represented  in  the  canonical  form 
(except  the  i  in  sci  and  the  e  in  see). 

2.  Adjacent  identical  consonants  are  skipped. 

3.  y  is  skipped  over  unless  it  appears  in  the  first 
or  last  position. 

4.  w  is  ignored  if  it  appears  in  the  last  position. 

5.  When  two  lexemes  both  match  to  the  beginning  of 
the  unparsed  string,  use  the  lexeme  with  the 
most  characters. 

6.  When  the  lexeme  has  more  than  one  corresponding 
phoneme  representation,  produce  enough  copies  of 
the  current  set  of  canonical  forms  so  that  each 
new  phoneme  representation  can  be  appended  to  a 
copy  of  each  form  in  the  current  set.  All  of  the 
unique  forms  so  produced  now  become  the  new 
current  set. 

If  s  and  t  have  any  canonical  forms  in  common,  they 

are  considered  equivalent. 


lexeme  phoneme  lexeme  phoneme 


b 

B 

but 

P 

p 

pen 

ch 

s 

machine 

q 

Q 

queen 

C 

chair 

ck 

K 

back 

r 

R 

rat 

cq 

Q 

acquire 

rh 

R 

rhubarb 

c 

s 

city 

sc  i 

S 

science 

K 

can 

s 

conscience 

dg 

J 

badge 

see 

S 

scene 

G 

s 

d 

D 

dog 

sh 

s 

she 

f 

F 

fill 

s 

S 

safe 

gn 

N 

gnaw 

z 

easy 

ght 

T 

slaughter 

tch 

s 

FT 

laughter 

C 

latch 

gh 

F 

laugh 

K 

G 

ghost 

t  io 

C 

question 

g 

J 

gem 

s 

nation 

G 

gum 

th 

t 

the 

h 

H 

hat 

t 

T 

take 

j 

J 

joke 

V 

V 

van 

kn 

N 

knot 

wh 

w 

when 

k 

K 

keep 

w 

w 

way 

1 

L 

late 

X 

KS 

vex 

m 

M 

man 

GZ 

exist 

n 

N 

nod 

y 

Y 

yet 

ph 

F 

phantom 

z 

Z 

zoo 
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Examples : 


Counterexamples : 


bridge  =>  BRJ  ,  BRD 
brige  =>  BRJ  ,  BRG 


quay  =*•  QY 
kway  =»  KWY 


acid  =>  SD ,  KD 
quiet  =»  QT 


cite  =»  KT ,  ST , CT 
kite  =>  KT 


tch.  The  existence  of  the  'phoneme'  Q  seems  part icular i ly 
unreasonable  because  a  mapping  to  /KW/  (as  in  queen )  and  A/ 
(as  in  daqueri)  would  serve  better. 

One  major  deficiency  is  the  absence  of  a  general 
mechanism  for  incorporating  positional  information  about  a 
lexeme  into  the  mapping  procedure.  As  with  the  digraph  gh , 
which  never  represents  /F/  in  the  initial  position  and  never 
represents  /G/  in  the  final  position,  knowing  the  position 
of  a  lexeme  can  frequently  remove  ambiguity  from  its 
translation . 

SYMONDS  is  incapable  of  identifying  most  cases  of 
misspellings  resulting  from  incorrect  or  alternate 
pronunciation  (the  approximate  phonetic  misspellings 
described  by  Masters).  Misspellings  such  as  dem  for  them  and 
ting  for  thing  will  not  be  recognized. 

Problems  relating  to  the  retrieval  and  storage  of 
information  have  motivated  most  of  the  work  on  approximate 
string  matching  algorithms.  The  extent  to  which  misspellings 
can  hamper  the  searching  of  data  bases  was  documented  by 
Bourne  (1977)  who  found  the  frequency  of  index  term 
misspellings  in  a  sample  of  11  bibliographic  data  bases  to 
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range  from  0.5%  in  one  data  base  to  almost  23%  in  another. 
While  the  recognition  of  equivalent  or  similar  keywords  has 
been  the  goal  most  relevant  to  response  analysis  in  CAI , 
early  work  on  the  abbreviation  of  stored  English  words  in  an 
era  of  expensive  memory  deserves  some  mention  here  because 
it  can  be  viewed  as  a  related  problem  involving  the 
generation  of  canonical  forms. 

Bourne  and  Ford  (1961)  compared  several  methods  for  the 
abbreviation  of  English  words  and  names  according  to  the 
dual  criteria  of  compactness  and  di scr iminabi 1 i ty  of  the 
abbreviated  form.  The  methods  considered  by  Bourne  and  Ford 
ranged  from  truncation  of  the  string  (from  either  end)  to 
procedures  where  some  mathematical  function  is  applied  to 
the  internal  numeric  representation  of  the  string.  Most  of 
the  methods  generated  a  canonical  form  by  eliminating 
characters  selected  according  to  a  set  of  simple  rules.  Only 
a  few  of  the  methods  described  are  reasonable  approaches  to 
the  approximate  string  matching  problem.  These  are  given  in 
Table  4. 

To  apply  the  last  three  methods  to  the  word  recognition 
problem,  rather  than  a  ranking  of  frequency  of  usage,  one 
requires  a  ranking  of  the  frequency  of  misusage  of 
characters.  A  method  for  the  recognition  of  misspellings 
proposed  by  Blair  (1960,  Table  5),  hereafter  referred  to  as 
BLAIR,  bears  some  similarity  to  the  third  algorithm  in  Table 
4,  but  was  modified  to  use  an  error  frequency  ranking.  Blair 
stated  that  his  coding  algorithm  based  on  error  frequency  is 
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Table  4:  Bourne  &  Ford's  Abbreviation  Algorithms 


1 .  Elimination  of  Vowels 

Starting  from  the  right  end  of  the  string  delete 
the  characters  a,e,i,o,u  until  the  required 
string  length  is  achieved  or  the  left  end  of  the 
string  is  reached.  In  the  latter  case,  truncate 
the  remaining  string  to  the  required  length. 

2.  Elimination  by  Character  Frequency 

Given  a  ranking  of  all  alphabetic  characters  by 
frequency  of  usage  obtained  from  a  sample  of 
words  similar  to  those  one  expects  to  operate 
on,  delete  the  most  common  characters  until  the 
required  length  is  achieved.  For  words  the 
ranking  was: 

El ROATNSLCPMDUHGYBFVKWXZ JQ . 

For  personal  names  the  ranking  was: 

EARNLOI STHDMCBGUWY JKPFVZXQ . 

3.  Elimination  by  Positional  Character  Frequency 
Given  a  separate  frequency  ranking  of  all 
alphabetic  characters  for  each  character 
position,  keep  finding  and  eliminating  the 
highest  ranking  character  until  the  required 
length  is  achieved.  In  the  case  of  ties,  delete 
the  rightmost  character  first.  It  was  reported 
that  the  differences  between  the  rankings  for 
each  position  were  very  small  for  all  but  the 
first  three  positions,  which  each  differed 
markedly  from  the  other  rankings. 

4.  Elimination  by  Character  Bigram  Frequency 
Given  a  frequency  ranking  of  bigrams,  determine 
a  score  for  each  character  by  summing  the  ranks 
for  the  two  bigrams  to  which  the  character 
belongs.  Then  start  eliminating  the  characters 
with  the  highest  scores  until  the  required 
length  is  achieved.  Bigrams  may  include  spaces, 
but  the  initial  character  is  retained  regardless 
of  its  score. 


superior  to  one  using  simple  character  frequency,  but 
unfortunately  no  details  of  this  comparison  were  supplied. 

Although  the  nominal  and  position  scores  were  reported 
to  be  based  on  the  "frequency  of  their  occurrence  as 
errors",  there  is  no  explicit  description  of  how  they  were 
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derived.  But  taking  the  quote  literally  it  is  apparent  that 
there  was  a  failure  to  include  the  frequency  of 
nonoccurrence  of  correct  characters  in  misspellings.  This  is 
important  because  not  only  misspellings  but  also  the  correct 
words  are  being  reduced  to  canonical  form.  One  possible  way 
of  generating  such  information  would  be  to  correct  a  large 
sample  of  misspelled  words  using  only  operations  of 
insertion  and  deletion.  With  each  operation  a  score 
associated  with  the  character  serving  as  the  operand  would 
be  incremented.  It  appears  that  Blair's  scores  would  be 
comparable  to  those  derived  were  one  to  only  count  the 
deletions  but  not  the  insertions. 

Davidson  (1962)  used  a  coding  method  which  was 
successfully  employed  in  the  retrieval  of  passenger  names 
from  an  airline's  record  system.  It  reduced  each  passenger's 
name  to  a  five  character  code  by  means  of  vowel  deletion, 
deletion  of  repeated  adjacent  characters,  and  truncation. 

The  inclusion  of  the  first  initial  as  the  fifth  character  of 
the  code  raises  the  interesting  possibility  of  special 
coding  techniques  for  certain  classes  of  words. 

One  expects  classes  of  words  or  phrases  such  as 
personal  names,  addresses,  adjectives,  various  forms  of 
specialized  jargon,  and  so  on,  to  have  different  equivalence 
relations.  If  Lawrence  Street  refers  to  a  place,  then 
Lawrence  St.  is  equivalent  but  L.  Street  probably  is  not. 

But  the  converse  is  true  if  a  person  is  being  referenced.  If 
separate  approximate  string  matching  functions  for 
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Table  5:  The  BLAIR  Algorithm 


Use  the  following  paired  list  to  obtain  a  nominal 
score  for  each  letter  in  the  word: 

ABCDEFGHI  JKLMNOPQRSTUVWXYZ 
51501  12560151343045341  1021 

Next  use  the  following  paired  lists  to  obtain  a 
positional  score  for  each  letter  in  the  word.  Select 
a  score  for  the  first  character  from  list  #1  then 
select  a  score  for  the  last  character  from  list  #2 
then  satisfy  the  second  character  from  #1  and  the 
penultimate  character  from  #2  and  so  on  until  every 
character  has  been  assigned  one  score: 

#1 


position  123456789  10... 
score  024556667  7... 


#2 


position  123456789  10... 
score  1  34556677  7... 

Finally,  find  the  deletion  rating  of  each  letter  by 
taking  the  sum  of  its  nominal  and  positional  scores 
and  then  delete  the  letters  with  the  n  highest 
deletion  ratings. 

distinguishable  vocabularies  are  shown  to  sufficiently 
optimize  the  recognition  process,  it  may  be  desirable  to 
provide  the  CAI  author  with  a  toolkit  of  specialized 
functions.  A  further  step  would  be  to  enable  the  author  to 
construct  a  function  to  suit  the  job  at  hand,  either  from 
scratch  or  by  modifying  existing  functions. 

in  his  monograph  Information  Retrieval  and  the 
Computer ,  Paice  (1977)  considered  the  recognition  of 
abbreviations,  synonyms,  alternate  correct  spellings,  and 
word  roots.  Of  course,  the  absence  of  rules  relating 
different  strings  with  similar  meanings  requires  methods  for 
the  recognition  of  synonyms  to  rely  on  something  like  a 
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thesaurus.  The  prospects  for  the  recognition  of 
abbreviations  and  contractions  are  not  much  better.  Paice 
presented  four  rules  for  the  generation  of  a  canonical  form 
useful  in  the  recognition  of  alternate  correct  spellings 
(Table  6).  A  rule  which  deleted  consonants  occurring 
internally  would  probably  be  an  effective  addition  to  this 
set  since  British  spellings  frequently  have  these  where 
American  spellings  do  not. 

A  CAI  author  may  be  willing  to  accept  a  word  from  the 
student  response  having  the  same  root  as  his  target  word. 
Paice' s  "conflation"  algorithm  (Table  7)  attempts  to  remove 
suffixes.  Paice  points  out  that  finding  the  correct  root  is 
not  necessary  as  long  as  "(i)  members  of  a  family  reduce  to 
the  same  root,  and  (ii)  members  of  different  families  reduce 
to  different  roots".  He  also  notes  that  procedures  for 
removing  prefixes  are  less  useful  because  the  result  of  such 
an  operation  is  frequently  a  word  with  an  opposite  or 
radically  different  meaning.  The  algorithm  in  Table  7  is 
executed  by  beginning  at  the  label  START  and  checking  the 
ending  given  in  the  table  against  the  ending  of  the  string 
being  conflated.  If  the  ending  matches,  replace  it  with  the 
replacement  given  and  follow  the  corresponding  transfer 
instruction.  Otherwise,  step  to  the  next  line  and  reiterate 
the  procedure.  The  dash  represents  that  substring  which 
encompasses  all  characters  excluding  the  suffix. 

The  applicability  of  the  methods  considered  here  to 
response  analysis  in  languages  other  than  English  will 
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Table  6:  Paice's  Alternate  Spelling  Conversion  Rules 


(-1  represents  some  substring) 

1.  Change  z  to  s  using  the  rule 

-'VzV-'  =>  ~,VsV~i 

where  V  is  a,e,i,o,u  or  y. 

Examples:  razor ,  analyze ,  realize 
Counterexamples:  hazard ,  squeeze 

2.  Change  ph  to  f  using  the  rule 

~'ph~'  => 

Examples:  sul pher ,  per i phera 1 ,  symphony 
Counterexamples:  uphill r  haphazard 

3.  If  the  original  word  has  a  length  >  5  then  apply 
the  rule 

-•our  =>  -'or 

Examples:  flavour,  humour,  four 
Counterexample:  devour 

4.  After  removing  endings  such  as  -e ,  -ate,  -at  ion, 
apply  the  rule 

-'tr  ->  -'ter 

Examples:  centr(e),  f i Itr(ate) ,  t  itri at  ion) 


probably  vary  with  the  similarity  of  the  language  to 
English.  Fendt  (1974)  documented  the  use  of  the  PLANIT 
algorithm  in  a  German  CAI  system  consisting  of  a  collection 
of  APL  functions.  On  the  other  hand,  in  languages  with  a 
systematic  orthography,  such  as  the  Japanese  hiragana  and 
katakana,  the  need  for  approximate  string  matching  functions 
may  be  negligible. 

One  way  of  summarizing  and  classifying  the  equivalence 
functions  reviewed  is  by  viewing  each  edit  operation 
contributing  to  the  conversion  of  a  word  to  a  canonical 
form,  as  a  substitution  operation  which  replaces  a  substring 
of  length  m  with  a  substring  of  length  n.  When  m=0  and  n>0, 
an  insertion  occurs.  When  m>0  and  n=0,  a  deletion  occurs. 
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Table 

7 :  Pa  ice ' s 

Conflation  Algorithm 

label 

ending 

replacement 

transfer 

START 

- ably 

- 

goto 

IS 

-  ibly 

— 

stop 

-ly 

— 

goto 

SS 

SS 

-ss 

-ss 

stop 

-ous 

— 

stop 

-  ies 

-y 

goto 

ARY 

-s 

— 

goto 

E 

-  led 

-y 

goto 

ARY 

-ed 

— 

goto 

ABLE 

-ing 

— 

goto 

ABLE 

E 

-e 

— 

goto 

ABLE 

-al 

— 

goto 

ION 

ION 

-  ion 

— 

goto 

AT 

— 

— 

stop 

ARY 

-ary 

— 

stop 

-ah  i 1 i ty 

— 

goto 

IS 

- i hi  1 i ty 

— 

stop 

-ity 

— 

goto 

IV 

-ify 

— 

stop 

— 

— 

stop 

ABL 

-ahl 

— 

goto 

IS 

-ibl 

- 

stop 

IV 

-  iv 

— 

goto 

AT 

AT 

-at 

— 

goto 

IS. 

IS 

-  is 

— 

stop 

-  if  ic 

— 

stop 

-olv 

-olut 

stop 

~~ 

stop 

The  simplest  coding  procedures  represented  by  the 
methods  of  Davidson,  Blair,  and  those  of  Bourne  and  Ford 
given  in  Table  4,  perform  only  single  character  deletion 
(m=1,n=0).  Two  simple  rules  seem  to  pervade  these  and  other 
methods  using  character  deletion:  delete  vowels  and  delete 
adjacent  repeated  characters.  While  the  latter  rule  is 
absent  from  the  the  coding  procedures  of  Blair,  and  Bourne 
and  Ford,  the  effect  of  the  delete  vowels  rule  is  achieved 
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by  assigning  most  vowels  high  weights  to  increase  the 
probability  of  their  deletion. 

The  conclusion  that  these  two  rules  together  constitute 
a  tolerably  successful  method  of  generating  canonical  forms 
is  supported  by  the  experience  of  workers  at  the  University 
of  Alberta.  One  extension  to  Coursewriter  II  (Romaniuk  and 
Schienbein,  1973)  developed  by  Peuchot,  which  also  appeared 
later  in  his  IMOGENE  instruction  module  generator 
( Peuchot , 1 975 ) ,  allowed  authors  to  enter  a  ’skeleton’  of  a 
student  response  they  were  expecting.  The  skeleton  word  was 
actually  a  canonical  form  which  successfully  matched  with 
any  word  containing  the  characters  of  the  skeleton  in 
correct  order  but  ignoring  absolute  character  position. 
Authors  usually  generated  the  skeleton  by  simply  removing 
all  vowels  and  repeated  consecutive  consonants. 

Slightly  more  complex  equivalence  functions  which  allow 
substitution  with  m= 1  and  n=1  are  represented  by  the 
soundex-like  methods.  For  these  methods,  an  attractive 
alternative  to  deleting  the  vowels  altogether  is  to  map  them 
all  to  a  single  character  before  applying  the  delete 
adjacent  repeated  characters  rule,  if  words  tend  to  be 
misspelled  by  the  substitution  of  vowels  for  other  vowels, 
then  the  reduction  in  type  II  errors  would  outweigh  the 
increase  in  type  I  errors.  Under  the  proposed  rule,  brake 
would  not  be  confused  with  bark.  However,  if  words  tend  to 
be  misspelled  by  the  deletion  of  vowels,  then  the  increase 
in  type  I  error  would  probably  be  too  high.  For  example, 
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brak  would  not  be  judged  equivalent  to  brake.  This  problem 
might  be  obviated  by  deleting  any  final  e  before 
transforming  the  vowels. 

The  most  complex  transformation  procedures  according  to 
this  classification  scheme  allow  substitution  with  values  of 
n  and  m  exceeding  one.  The  functions  falling  into  this 
category,  SYMONDS  and  Paice's  alternate  spelling  and 
conflation  routines,  all  require  a  parsing  of  the  original 
string . 

String  Similarity  Functions 

Recall  that,  given  strings  t  and  s,  a  similarity 
function  returns  a  value  rts  which  is  a  measure  of  the 
proximity  of  t  and  s.  Hall  and  Dowling  noted  that  the 
similarity  relation  can  be  used  either  to  find,  all  strings 
ti,t2,t3,...  tn  such  that  rts  is  "greater  than  some 
threshold  of  acceptability"  (previously  called  R) ,  or  to 
find  "the  N  strings  t ,  , t 2 , t 3 . . . t n  such  that  their  [rts]  have 
the  N  largest  values".  These  formulations  reflect  procedures 
associated  with  the  retrieval  of  information  where  an  index 
term  entered  by  a  user  is  compared  against  index  terms 
linked  with  the  stored  data.  Although  similar  procedures  may 
be  used  in  an  instructional  program  to  query  the  student 
about  the  intended  response  by  displaying  all  target  words 
meeting  some  threshold,  consideration  is  given  here  only  to 
the  comparison  of  s  and  t  such  that  a  binary  decision  rts  is 
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Hall  and  Dowling  observed  that  while  the  range  of  r  is 
arbitrary,  "the  value  of  +1.0  for  an  exact  match  seems  to 
have  strong  intuitive  appeal,  and  the  range  of  values  from 
-1.0  to  +1.0  appears  to  gain  respectability  from  correlation 
coefficients  and  normalized  inner  products".  The  range  0.0 
to  1.0  has  been  more  commonly  used  perhaps  because  the 
similarity  between  two  strings,  by  analogy  with  physical 
distance,  is  heur i st ically  an  unsigned  magnitude  for  most 
functions . 

The  similarity  relation  can  be  viewed  as  more 
appropriate  than  the  equivalence  relation  for  the  problem  of 
approximately  matching  English  words  due  to  the  intolerance 
of  the  latter  to  ambiguity.  Mutually  exclusive  equivalence 
classes  cannot  represent  the  frequent  occasion  where  a 
feature  of  the  word  is  similar  to  two  or  more  other  features 
which  are  themselves  dissimilar.  For  example,  in  an 
orthographic  sense,  c  is  similar  to  both  s  and  k;  but  s  and 
k  are  dissimilar.  This  was  seen  to  be  the  case  with  SYMONDS, 
which  needed  to  destroy  the  mutual  exclusivity  of 
equivalence  classes  in  order  to  accurately  interpret  the 
relationship  between  lexemes. 

The  earliest  publication  describing  the  string 
similarity  relation  is  apparently  that  of  Glantz  (1957).  He 
proposed  a  function  which  simply  padded  the  shorter  string 
with  blanks  and  tested  the  two  characters  at  each  position 
in  the  strings  for  equality.  The  number  of  mismatched 
characters  was  divided  by  the  string  length  to  yield  a  value 
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of  r  which  ranged  from  0.0  to  1.0. 

Damerau  (1964)  observed  that  more  than  80%  of 
keypunching  errors  were  single  instances  of  either 
insertion,  deletion,  substitution,  or  adjacent  transposition 
of  characters  in  a  word.  He  suggested  that  "these  are  the 
errors  one  would  expect  as  a  result  of  misreading,  hitting  a 
key  twice,  or  letting  the  eye  move  faster  than  the  hand".  An 
algorithm  (Table  8)  was  proposed  whereby  two  strings  (A  and 
B)  are  judged  to  be  similar  if  one  could  be  converted  to  the 
other  by  any  one  of  these  operations.  DAMERAU  has  been  used 
successfully  to  detect  misspelling  of  keywords  in  compilers 
(Morgan,  1970)  and  operating  systems  such  as  MTS5. 

Faulk  (1964)  defined  three  measures  of  string 
similarity:  material,  ordinal,  and  positional.  All  three 
functions,  and  also  a  superordinate  function  which  contains 
them,  return  values  ranging  from  0.0  to  1.0.  Given  the 
strings  A  and  B,  having  lengths  m  and  n,  a  matching  pair  of 
characters  is  represented  by  the  coordinates  (i,j).  K  is  the 
complete  set  of  d  matching  pairs  between  A  and  B.  q  is  the 
number  of  pairings  of  matching  pairs  in  which  both  the  i  and 
j  coordinates  of  one  matching  pair  are  greater  than  those  of 
the  other  matching  pair,  q  has  a  maximum  value  of  d2-d.  To 
illustrate,  for  the  strings  abed  and  adadec ,  K  becomes 
( 1 , 1 ) , (4,2) , ( 1 ,3) , (4,4) , (3,6)  and  n=4,  m=6,  d=5,  q= 1 0 . 
However,  for  comparisons  involving  "redundent"  strings, 
those  which  contain  more  than  one  instance  of  each  character 
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Table  8:  The  DAMERAU  Algorithm  In  Pseudo-Pascal 


var  m,n,i,diff  :  integer; 

errorcount , f irsterror , lasterror  :  integer; 

procedure  match ( length ) ; 
begin 

errorcount  ;=  0; 
firsterror  ;=  0; 
lasterror  :=  0; 
for  i : = 1  to  length  do 
begin 

if  A [ i ]  t  B [ i ]  then 
begin 

errorcount  :=  errorcount  +  1; 
if  errorcount  =  1  then  firsterror  :=  i 
else  lasterror  :=  i; 
end; 

end ; 

end; 

begin 

m  : =  length (A) ; 
n  : =  length(B ) ; 
diff  :=  m-n; 

if  (diff  <  -1)  or  (diff  >  1)  then  print  "dissimilar" 
else  case  diff  of 
0;  begin 

match (m) ; 

if  errorcount  <  3  then 
case  errorcount  of 
0,1;  print  "similar" ; 

2:  if  (firsterror  =  lasterror-1) 

and  (A[ f irsterror  ]  =  B [ lasterror ] ) 
and  ( B [ f i r sterror ]  =  A [ lasterror ] ) 
then  print  "similar" 
else  print  "dissimilar"; 
endcase 

else  print  "dissimilar"; 
end ; 

- 1 :  begin 

match (m) ; 

if  errorcount  =  0  then  print  "similar" 
else  begin 

delete(B[firsterror] ) ; 
match (m) ; 

if  errorcount  =  0  then  print  "similar" 
else  print  "dissimilar"; 
end ; 

end; 

1 :  begin 

match ( n ) ; 

if  errorcount  =  0  then  print  "similar" 
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else  begin 

delete(A[firsterror] ) ; 
match (m) ; 

if  errorcount  =  0  then  print  "similar" 

else  print  "dissimilar"; 

end; 

end; 
endcase ; 
end . 


as  is  the  case  with  the  above  example,  K  is  not  useful. 
Instead,  Faulk  defined  K?  to  be  the  set  of  optimal  matched 
pairs.  That  is  to  say,  given  that  only  a  one-to-one 
correspondence  of  elements  is  allowed,  the  matched  pairs 
(i,j)  are  chosen  such  that  the  sum  of  (i-j)* 1 2  over  all  d 
matched  pairs  is  minimal.  With  this  definition,  d  can  attain 
a  maximal  value  of  MIN(m,n).  The  example  comparison  above 
produces  K'  =  (  1  ,  1  )  ,  ( 3 , 6 )  ,  ( 4 , 4 )  and  d=3,  q=4. 

Now  the  three  measures  of  similarity  can  be  given  as 
follows : 

1.  Material  similarity  =  2d/(m+n). 

2.  Ordinal  similarity  =  q/((m2-m)/2  +  (n2-n)/2). 

3.  Positional  similarity  =  2 ( rmax-r )/( (m3-m)/3  +  (n3-n)/3). 
Letting  X  be  an  index  to  the  elements  of  K,  r  is  the 
total  amount  of  positional  disparity 

r=  X( i x" j  x ) 2 

x  =  0 

and  rmax  is  the  maximum  value  r  can  attain  for  a  given 
value  of  d  assuming  nonredundant  strings: 

rmax=  I(MAX(m,n)-2X- 1 ) 2 

x  =  o 
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When  d=m=n,  rmax  attains  a  maximum  value  of  (m3-m)/3 
which  is  the  basis  for  the  normalizing  divisor. 

Finally  a  Total  Similarity  Function  is  given  as: 

(material  +  ordinal  +  posit ional )/3 
which  raises  the  question  of  optimal  weightings  for  the 
subordinate  functions. 

Alberga  (1967)  reported  on  and  tested  approximate 
string  matching  functions  developed  by  himself  and  others  at 
the  IBM  Watson  Research  Center.  Alberga  chose  to  express  the 
algorithms  he  presented  as  operations  on  a  binary 
coincidence  matrix  E  generated  from  two  strings.  E  has  the 
order  m  x  n  corresponding  to  the  lengths  of  the  two  strings. 
E i j  is  either  1  or  0  depending  on  the  equality  of  the  ith 
character  in  the  first  string  with  the  jth  character  in  the 
second  string. 

A  general  approach  was  proposed  which  decomposed  the 
algorithms  into  three  phases  of  operations  on  coincidence 
matrices:  weighting,  selection,  and  similarity  operations. 
Several  complete  similarity  functions  (Alberga  tested  65) 
can  be  constructed  by  choosing  one  operation  for  each  phase 
(Table  9).  The  weighting  and  selection  phases  are  optional. 

Weighting  operations  assign  to  each  element  of  the 
matrix  a  value  related  to  the  probability  that  one  of  the 
corresponding  characters  is  derived  from  the  other.  ROOF  is 
based  on  the  principle  that  the  matrix  elements  of  derived 
pairs  tend  to  cluster  around  the  axis.  The  axis  is  defined 
as  the  0th  diagonal  where  the  Kth  diagonal  was  any  subset  of 
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Figure  4:  Matrix  Augmentation  in  CONTEX 
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elements  such  that  i-j  equaled  some  constant  K. 

CONTEX  is  based  on  the  reasonable  notion  that  the 
probability  that  a  character  match  is  not  spurious  is 
dependent  on  whether  the  surrounding  characters  match  as 
well.  The  augmentation  of  the  matrix  is  the  same  as  adding 
two  delimiters  to  both  ends  of  the  string  which  match  each 
other  but  none  of  the  internal  characters.  The  weighting 
factor  for  CONTEX  given  in  Table  9  is  a  post  hoc  replacement 
suggested  by  Alberga  for  an  apparently  erroneous  factor  he 
used  in  his  comparative  study. 

The  result  of  the  selection  phase,  is  generally  an 
m  x  n  matrix  having  no  more  than  one  nonzero  element  in  each 
row  and  column.  Note  that  the  SFIRST,  SORDER,  and  LSTNG 
operations  are  unaffected  by  any  weighting  operations  which 
precede  them.  The  asymmetry  which  characterizes  SFIRST, 
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Table  9:  Alberga's  Matching  Operations 


Weighting  Phase 

1 .  ROOF 

Multiply  every  i,jth  element  by  1-DU  where  Du 
is  the  distance  between  that  element  and  the 
axis: 

D,  j=| (i— 1 )/ (m— 1 )— (j— 1 )/(n— 1 ) I 

2 .  CONTEX 

Augment  the  matrix  by  adding  four  rows  and  four 
columns  as  in  Figure  4.  Apply  the  following 
weighting  factor  to  each  element  Eu: 

1/14 ( 1+3 (E i  +  !  ,  j  ♦  ,+E, . i  ,  j  .  i  )  + 

2(Ej+2,  j  +  2  "^E  j  _  2  ,  j  -  2  )  "*"E  i  +  1  ,  j  +  lE  i  +  2  ,  j  +  2 

Ei-i,j-iEi_2/j-2 +E i  +  i  ,  j  +  i E i _ i  i  j  _ i ) 


Selection  Phase 

1.  SFIRST 

Starting  with  the  top  row,  search  each  row  from 
left  to  right  and  find  the  first  nonzero  element 
Ejj.  Zero  all  the  other  elements  in  that  row. 
Start  the  search  in  the  next  row  at  the  (j+1)st 
column.  Halt  the  operation  and  zero  any 
remaining  rows  when  either  i=m  or  j=n. . 

2 .  SORDER 

This  operation  is  the  same  as  SFIRST  except  that 
when  no  nonzero  elements  can  be  found  in  a  row, 
the  search  is  resumed  at  the  (i+2,j+2)th 
element . 

3 .  SBYC 

Starting  at  the  top  row,  find  the  largest 
element  in  each  row  not  occupying  a  columnar 
position  held  by  a  previously  selected  element. 
Zero  all  other  elements  in  that  row. 

4 .  SMAX 

Find  the  largest  element  in  the  matrix.  Zero  all 
other  elements  occupying  the  same  row  and 
column.  Continue  this  procedure  by  finding  the 
next  largest  elements  until  all  rows  and  columns 
have  been  processed. 

5.  LSTNG 

Examine  each  diagonal  to  find  the  largest  set  of 
consecutive  nonzero  elements.  Zero  all  other 
elements  in  the  rows  and  columns  occupied  by 
that  set.  Continue  this  procedure  by  finding  the 
next  largest  sets  until  all  nonzero  elements  are 
accounted  for  as  members  of  some  set. 

6 .  MAXMON 

Set  to  zero  all  elements  except  those 
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constituting  a  path  or  vector  E  j  1j1...Eidjd 
whose  elements  have  a  maximum  sum  under  the 
constraint  that  id<id+1  and  jd<jd+1. 


Similarity  Phase 

1  .  SUM 

Find  the  sum  of  the  elements  in  the  matrix, 
normalizing  divisor:  MAX(m,n). 

2 .  DBL 

Find  the  sum  of  the  elements  in  each  diagonal. 
Multiply  each  sum  by  the  length  of  the 
respective  diagonal.  Then  obtain  the  sum  of 
these  products. 

normalizing  divisor:  MAX(m,n)2. 

3.  PAIRS 

Calculate  the  sum  of  the  products  of  all  pairs 
of  diagonally  adjacent  elements, 
normalizing  divisor:  MAX(m,n)-1. 

4.  STRING 

For  every  set  of  k  diagonally  adjacent  elements, 
multiply  the  first  member  (that  having  the 
lowest  indices)  by  k,  the  second  member  by  k-1 
and  so  on  until  the  last  member  is  multiplied  by 
1.  Then  sum  all  the  elements  in  the  matrix, 
normalizing  divisor:  1 /2 (MAX (m , n ) 2 -MAX (m , n ) ) . 

5.  TRNSP 

For  selected  matrices  only.  Collapse  the  matrix 
by  deleting  all  rows  and  columns  containing  only 
zeros.  Let  k  be  the  number  of  row  and  column 
interchanges  necessary  to  produce  an  identity 
matrix,  or  a  matrix  having  all  the  nonzero 
elements  in  the  axis.  Calculate 
1  - k / ( MAX ( m , n ) -  1 ) . 


SORDER  and  SBYC  is  part icular i ly  detrimental  in  the  case  of 
SFIRST,  which  will  be  scuttled  by  an  insertion  appearing 
near  the  beginning  of  the  vertical  string  or  a  deletion  near 
the  beginning  of  the  horizontal  string. 

MAXMON  resolves  situations  where  more  than  one  path  is 
maximal  by  applying  the  selection  scheme  illustrated  in 
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Figure  5;  Resolving  Directional  Preference  in  HAXHON 


Figure  5.  In  this  scheme  the  preferred  direction  at  any 
point  in  a  path  is  determined  by  the  current  location  in  the 
matrix.  In  regions  close  to  the  axis,  diagonal . movement  is 
most  heavily  weighted;  and  in  outlying  regions,  movement 
toward  the  axis  receives  the  greatest  weight. 

The  LSTNG  operation  is  appealing,  not  only  because  it 
does  not  require  a  weighting  phase,  but  also  because  it  can 
be  simply  described  in  English  as  the  procedure  of  first 
finding  the  longest  matching  substring  and  continuing  to 
find  the  longest  matching  substrings  in  the  remaining 
positions. 

The  final  phase  of  operation  on  the  matrix  returns  a 
scalar  which  is  r,  the  measure  of  similarity  between  the  two 
strings.  In  these  calculations,  Alberga  applied  a 
normalizing  factor  to  produce  a  value  which  ranged  from  0.0 
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to  1.0.  The  normalizing  divisors  given  in  Table  9  apply  only 
to  matrices  having  no  more  than  one  nonzero  element 
(selected  matrices).  Appropriate  normalizing  factors  could 
not  be  found  for  unselected  matrices.  Instead, 
pre-normal ized  values  were  divided  by  the  mean  of 
pre-normal ized  similarity  measures  of  each  word  with  itself. 

Szanser  developed  a  string  similarity  technique  which 
he  has  named  elastic  matching  and  has  discussed  in  several 
documents  (Szanser,  1969,  1971,  1973a,  1973b).  Since  it 
matches  strings  differing  by  one  instance  of  character 
insertion,  deletion,  substitution,  or  adjacent 
transposition,  it  really  amounts  to  being  a  rather  speedy 
implementation  of  Damerau's  algorithm. 

The  fundamental  procedure  in  elastic  matching  is  to 
first  independently  break  both  words  into  multiple  "lines" 
as  in  Figure  6  where  the  word  november  is  used.  Each  line 
has  a  fixed  length  equal  to  the  size  of  the  alphabet  being 
used.  Each  position  within  a  line  corresponds  to  a  member  of 
that  alphabet  and  can  assume  a  boolean  value.  The  particular 
character  that  a  position  represents  depends  on  the  ordering 
of  the  alphabet  that  has  been  chosen.  In  the  examples  which 
follow,  the  characters  and  order  of  the  standard  English 
alphabet  are  used.  Strings  are  mapped  on  to  a  series  of 
lines  by  giving  the  value  1  to  the  positions  which  represent 
characters  present  in  the  string  and  the  value  0  to  other 
positions.  The  order  of  the  characters  in  the  word  is 
preserved  by  starting  a  new  line  whenever  it  conflicts  with 
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Figure  6;  Elastic  hatching 


original  string  =>  NOVEMBER 
corrupted  string  =>  NOEMBER 


NO 

V 

E 

M 

B 

E 

R 

NO 

E 

M 

B 

E 

R 

I 

conflicting  character 


the  alphabetic  order. 

The  elastic  matching  technique  works  well  in  the  case 
where  the  number  of  lines  obtained  for  both  words  is  the 
same.  A  single  exclusive  or  operation  (XOR)  gives  the  number 
of  conflicting  characters.  Szanser  has  set  a  maximum 
threshold  of  one  conflicting  character  permitted  in  a  match 
but  also  noted  that  the  method  can  be  modified  to  support 
higher  thresholds.  If  the  algorithm  is  programmed  at  the 
machine  level  with  each  position  in  a  line  being  represented 
by  a  bit,  then  one  could  expect  very  fast  execution  time. 
Furthermore,  in  CAI  applications  the  author’s  target  word 
could  be  mapped  into  the  appropriate  form  at  compile  time. 

Problems  arise  when  an  unequal  number  of  lines  is 
obtained  as  in  Figure  7.  If  the  number  of  lines  formed  from 
the  two  strings  differ  by  more  than  one,  then  the  procedure 
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Figure  7:  Errors  Causing  Different  Line  Lengths  in  Elastic  Matching 


original  string  =>  j  A  J  Q  |  1  M  Z~|  1  J  M  Q 


deletion 


=>  A  J  Q  Z 


J  M  Q 


insertion 


=>  A  J  Q 


M  Z 


J  M  Q 


is  abandoned,  otherwise  an  XOR  is  applied  to  the  paired 
lines.  The  first  line  in  the  student  string  which  has 
conflicting  bits  causes  the  following  lines  to.be  moved 
ahead  if  the  student  string  has  fewer  lines  or  moved  back  if 
it  has  more  lines.  A  new  matching  process  then  commences  at 
the  first  moved  line  with  the  previously  matched  characters 
masked  out.  If  it  results  in  no  more  than  one  conflicting 
bit  then  an  approximate  match  is  declared. 

Wagner  and  Fischer  (1974)  have  presented  the  most 
cogent  general  solution  of  the  string  similarity  problem  to 
date.  Since  it  is  based  on  an  accounting  of  the  edit 
operations  necessary  to  convert  one  string  into  another, 
their  analysis  can  be  viewed  as  a  generalization  of  the 
Damerau's  work.  They  allowed  the  three  edit  operations  of 
character  deletion,  insertion,  and  substitution,  or,  as 
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previously  expressed,  substitution  operations  on  substrings 
of  length  0  or  1 . 

Each  possible  edit  operation  has  a  cost  associated  with 
it.  The  cost  of  an  edit  sequence  is  the  sum  of  the  costs 
associated  with  the  series  of  operations  comprising  the 
sequence.  They  define  "the  edit  distance  from  string  A  to 
string  B  [to]  be  the  minimum  cost  of  all  sequences  of  edit 
operations  which  transform  A  into  B". 

The  simplest  weighting  scheme  assigns  a  cost  of  1  to 
each  operation  except  the  substitution  of  a  character  with 
itself,  which  normally  has  a  cost  of  zero.  Wagner  and 
Fischer  note  that  a  better  approach  is  to  weight  operations 
inversely  to  their  probability  of  occurrence  in  minimal  cost 
edit  sequences  between  words  and  their  misspellings  or 
mistypings: 

...  cost  functions  which  depend  on  the  particular 
characters  affected  by  an  edit  operation  might  be 
useful  in  spelling  correction,  where  for  example 
because  of  the  conventional  keyboard  arrangement  it 
may  far  more  likely  that  a  character  "A"  be  mistyped 
as  an  "S"  than  as  a  "Y". 

Every  edit  sequence  between  two  strings  can  be 
partially  represented  by  a  trace.  In  Figure  8,  an  example  of 
the  diagramatic  expression  of  a  trace,  characters  untouched 
by  lines  were,  depending  on  the  direction  of  the  trace, 
either  inserted  or  deleted.  A  dashed  line  indicates 
substitution  of  a  different  character  and  a  solid  line 
indicates  substitution  of  the  same  character.  One  can  think 
of  a  trace  as  "a  description  of  how  an  edit  sequence 
transforms  A  into  B  but  ignoring  the  order  in  which  things 
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Figure  8:  A  Trace  Between  Two  Strings 


acdzeegx 


happen  and  any  redundency  in"  the  edit  sequence. 

Wagner  and  Fischer  proved  that  any  edit  sequence 
transforming  string  A  to  string  B  can  be  represented  by  a 
trace  such  that  the  cost  of  the  trace  is  less  than  or  equal 
to  the  cost  of, the  edit  sequence.  Also,  any  trace  from  A  to 
B  represents  at  least  one  edit  sequence  having  the  same 
cost.  Therefore,  to  find  the  edit  distance,  it  is  only 
necessary  to  find  the  trace  of  least  cost.  In  the  algorithm 
which  accomplishes  this,  WAGNER  (Table  10),  idcost  is  a 
subordinate  function  which  returns  the  cost  of  inserting  or 
deleting  the  character  passed  to  it.  subcost  is  a  function 
which  returns  the  cost  of  a  substitution  operation  involving 
the  two  characters  passed  as  parameters. 

Figure  9  Shows  the  matrix  generated  in  the  calculation 
of  the  edit  distance  between  the  word-misspelling  pair  of 
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Table  10:  The  WAGNER  Algorithm  In  Pseudo-Pascal 


A,B  :  string; 
i , j ,  m, n :  integer ; 

D[mrn]:  array  of  integer; 

***  first  build  the  matrix  D[m,n]  *** 

BEGIN 

m  : =  LENGTH ( A ) ; 
n  : =  LENGTH ( B ) ; 

D [ 0 , 0  ]  :=  0; 

FOR  i : = 1  TO  m  DO  D[i,0]  :=  D[ i- 1 , 0 j+idcost ( A [ i ] ) ; 
FOR  j:=1  TO  n  DO  D[0,j]  :=  D[0  ,  j- 1 ]  +  idcost (B[ j ] ) ; 
FOR  i  :=  1  TO  m  DO 

FOR  j  : =  1  TO  n  DO 

D[i,j]  :=  MI N ( D [ i -  1 , j -  1 ] + subcost (A[i],B[j]), 

D [ i -  1  ,j]+idcost(A[i] ) , 

D[  i  ,  j~  1  3  +  idcost (B[ j ] ) ) ; 

***  now  output  edit  distance  *** 

PRINT  D  [  m ,  n  ] ; 

END. 


circle  vs.  curcal .  In  the  example,  subcost  is  0.7  and  idcost 
is  0.5.  The  minimal  cost  edit  sequence  is  indicated  by  lines 
connecting  some  of  the  matrix  elements.  A  diagonal  line 
indicates  substitution,  a  vertical  line  indicates  deletion 
of  a  character  from  the  misspelling,  and  a  horizontal  line 
indicates  insertion  of  a  character  into  the  misspelling.  The 
resulting  edit  distance,  found  in  the  lower  right  matrix 
cell ,  is  1.7. 

Lowrance  and  Wagner  (1975)  extended  the  calculation  of 
edit  distance  to  include  transposition  operations  between 
any  two  characters.  Hall  and  Dowling  noted  that  the 
transposition  of  only  adjacent  characters  is  a  special  case 
of  Lowrance  and  Wagner's  extension  and  can  be  accounted  for 
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Figure  9:  An  Example  Showing  the  Calculation  of  Edit  Distance 
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by  simply  adding  to  the  minimization  the  expression: 
D[i-2,j-2]  +  COST (A[i-l],B[j])  +  COST ( A [ i ] , B [ j -  1 ] ) .  But 
since  this  would  equate  the  cost  of  a  transposed 
substitution  to  that  of  a  simple  substitution,  it  seems  to 
the  present  author  that  either  a  constant  should  be  added 
which  represents  the  general  cost  of  transposed 
substitution,  or  a  table  must  be  consulted  to  obtain  the 
specific  cost  of  the  operation  given  the  two  characters  in 
question . 

The  edit  distance  algorithm  is  appealing  for  several 
reasons.  It  is  a  well  rationalized  and  intuitively 
reasonable  formulation.  Although  clearly  not  as  fast  as  some 
other  methods,  it  has  satisfactory  computational  bounds 
proportional  to  mn.  Since  it  relies  on  a  changeable  data 
structure  to  assign  weights  to  various  edit  operations,  it 
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is  flexible  enough  to  support  CAI  applications  in  different 
natural  languages,  content  areas,  and  so  on.  Perhaps  its 
most  important  advantage  in  CAI  is  that,  after  the  matrix  is 
built,  the  trace  of  the  edit  distance  can  be  obtained  and 
used  to  show  the  student  how  to  correct  his  response. 

However,  a  major  drawback  to  the  edit  distance 
algorithm  as  it  currently  stands  is  its  inability  to  operate 
on  parsed  lexemes  with  lengths  greater  than  one  —  a  fact 
which  obstructs  the  recognition  of  many  phonetically  based 
errors.  In  addition,  characters  are  not  differentially 
weighted  according  to  position. 

Hybrid  Approximate  String  Matching 

A  few  reports  exist  of  a  hybrid  approach  to  approximate 
string  matching.  In  hybrid  methods  canonical  forms  of  the 
original  strings  serve  as  operands  to  a  similarity  function. 
For  example,  Jackson  (1967)  described  a  system  used  in  stock 
brokerage  which  reduced  company  names  serving  as  index  terms 
to  a  canonical  form  by  vowel  deletion  and  other  means 
specific  to  the  application,  then  performed  a 
similarity-type  matching  operation  between  the  index  terms 
and  abbreviated  company  names  input  by  the  user. 

The  most  outstanding  contribution  to  the  application  of 
approximate  string  matching  to  CAI  has  been  that  of  the 
Computer-based  Education  Research  Laboratory  (CERL)  at  the 
University  of  Illinois  --  the  originators  of  PLATO.  The 
principles  outlined  in  a  CERL  report  (Tenczar  and  Golden, 
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Figure  10:  PLATO  Word  Recognition 


pneumonia  (author) 

newwonia  ( student ) 


content  words 


conflict  word 


mapping 

(author) 
(student) 

exclusive  OR 
00001 100010000000000 


00101101000011001100 

00100001010011001100 


1972)  form  the  basis  for  the  spelling  algorithms  used  in 
PLATO  at  Illinois,  CDC ' s  commercial  version  of  PLATO,  and 
the  TICCIT  system  now  under  the  proprietorship  of  Hazeltine. 
Unlike  most  of  the  other  methods  summarized  in  this  chapter, 
the  PLATO  word  recognition  algorithm  was  designed 
specifically  to  operate  within  the  practical  constraints  of 
a  time-shared  CAI  system  in  vivo. 

In  the  PLATO  method,  both  the  author  and  student  words 
are  mapped  to  canonical  forms  called  content  words.  Author 
words  can  be  converted  to  content  words  before  run  time  to 
enable  sizable  gains  in  execution  speed.  Content  words  have 
a  fixed  bit  length  which  depends  on  the  particular  mapping 
algorithm  used. 

An  XOR  operation  is  performed  on  the  two  content  words 
to  obtain  a  confi ict  word  which  represents  matched  bits  by  0 
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Figure  11 :  PLATO  Letter  Content  field 


machine  => 

1010110010010001 

masheen  => 

1010101010000001 

ETAONISRHLDCUPFM 

QVGYJKBWXZ 

Figure  12:  An  alternate  Mapping 


Machine  — > 

Masheen  ==> 

ETAONISRHDCUPFXG 

ZLWMKQBVYJ 


1010110011100000 

1010101011000000 


and  mismatched  bits  by  1  as  in  Figure  10.  If  the  conflict 
word  has  all  zero  bits  then  the  author  and  student  words  are 
judged  to  be  an  exact  match.  Otherwise,  a  binary  search  of  a 
list  of  content  words  mapped  from  the  500  most  frequent 
English  words  is  performed  to  find  one  which  equals  the 
student  content  word.  If  such  a  match  is  found,  the  student 
word  is  judged  to  be  different  from  the  author  word.  When  an 
exact  match  cannot  be  found  with  either  the  author  word  or 
any  of  the  most  frequent  English  words,  then  the  word  is 
judged  to  be  a  misspelling  of  the  author  word  if  the  number 
of  1  bits  in  the  conflict  word  does  not  exceed  some  fixed 
threshold . 

The  fundamental  approach  to  the  generation  of  content 
words  proposed  by  Tenczar  and  Golden  is  the  most  significant 
and  original  feature  of  the  PLATO  spelling  algorithm: 
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Past  attempts  to  devise  methods  for  recognizing 
spelling  errors  have  used  a  minimal  set  of  human 
criteria  (e.g.,  the  phonetic  approach).  However,  no 
one  criteria  appears  sufficient  to  do  the  mysterious 
thing  which  humans  do  when  they  recognize  words. 

Rather,  we  should  use  as  many  features  of  words  as 
we  can  think  of  and  hope  that  the  interactions 
between  these  factors  will  contain  the  information 
that  human  beings  use. 

To  implement  this  concept  the  content  word  is  divided  into 
fields,  each  representing  some  characteristic  of  the 
original  string.  The  field  lengths  are  determined  in  part  by 
a  "subjective  feeling"  for  the  importance  of  the 
characteristic  which  a  field  represents.  The  fields  and 
their  lengths  suggested  by  Tenczar  and  Golden  were: 

1 .  Word  Length  ( 3 ) 

2.  First  Character  (4) 

3.  Letter  Content  (16) 

4 .  Letter  Order  (10) 

5.  Syllabic  Pronunciation  (8) 

They  found  that  a  41  bit  content  word  divided  into  the  above 
fields  produced  satisfactory  performance. 

It  was  recognized  that  a  standard  binary  representation 
of  word  length  is  inadequate  because  the  number  of  conflict 
bits  produced  by  an  XOR  operation  would  not  be  proportional 
to  the  difference  in  word  lengths.  With  radix  2 
representation,  a  word  length  of  1  (represented  as  001) 
would  produce  only  one  conflict  bit  when  compared  with  a 
word  of  length  5  (represented  as  101).  Instead,  a  coding 
scheme  which  attenuates  or  avoids  this  problem  should  be 
used.  Tenczar  and  Golden  put  forth  the  grey  or  unit-distance 
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code  (Bartee,  1972)  as  a  likely  candidate. 

The  introduction  of  a  certain  amount  and  type  of 
ambiguity  when  mapping  to  the  content  word  is  an  important 
feature  of  the  PLATO  spelling  algorithm.  For  example,  one 
bit  in  the  First  Character  field  is  determined  by  whether 
the  character  is  a  consonant  or  a  vowel.  The  specific 
character  is  then  only  approximately  specified  by  the 
remaining  bits. 

The  Letter  Content  field  gives  an  approximation  of 
which  letters  are  present  in  a  word  without  regard  to  their 
order  and  frequency.  In  this  field  insertions  and  deletions 
are  likely  to  cause  1  conflict  bit,  substitutions  result  in 
2  conflict  bits,  and  transpositions  will  produce  0 
conflicts.  Figure  11  illustrates  one  scheme  suggested  by 
Tenczar  and  Golden  for  mapping  the  Letter  Content  field.  In 
the  method  shown,  savings  in  space  were  realized  by  mapping 
the  26  letter  alphabet  to  a  16  bit  field.  This  was 
accomplished  by  doubling  up  the  less  frequent  characters  on 
10  of  the  bits  so  as  to,  one  may  presume,  "heed  the 
information  theory  dictum  which  states  that  in  a  coding 
scheme  one  should  strive  to  have  a  probability  of  0.5  of 
finding  any  given  bit  set".  But  the  doubling  of  characters 
on  bits  can  be  put  to  a  better  use  which  the  authors  fail  to 
note.  The  criteria  for  pairing  characters  could  be 
determined  by  the  frequency  of  substitution  between 
characters  in  misspellings  or  frequency  of  their  occurrence 
together  as  digraphs.  Figure  12  gives  an  alternate  mapping 


,  ■  ■  yt  5>»;  •  b  c.  i**  ‘  rnt  ** 


^  glj  1  n 


. 


'42 


47 


Figure  13:  PLATO  Letter  Order  Mapping 
sample  word 

internal  representation 


modulo  10  of  digraph  sums 


PIECE 

16  09  05  03  05 


5  4  8  8 


letter  order  field 
of  content  word 


0001100100 


123456789  10 


for  letter  content  which  may  be  more  useful.  In  this 
alternate  mapping  scheme  an  attempt  has  been  made  to  pair 
similar  sounding  letters  together  somewhat  after  the  soundex 
method. 

The  letter  order  field  is  created  by  summing  the 
internal  values  (perhaps  ASCII  or  EBCDIC  representations)  of 
each  pair  of  adjacent  characters  as  in  Figure  13.  Where  n  is 
the  length  of  the  field,  the  values  resulting  from  a  modulo 
n  operation  on  the  sums  are  mapped  into  the  field.  Both 
adjacent  character  transposition  and  character  substitution 
are  likely  to  result  in  2  conflict  bits  if  they  do  not  occur 

at  the  end  or  beginning  of  the  word  —  in  which  case  they 

are  likely  to  cause  1  conflict  bit.  Insertions  and  deletions 

are  likely  to  cause  3  conflict  bits  if  they  occur  internally 

but  only  1  conflict  bit  when  occurring  at  the  beginning  or 
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Figure  14:  Current  PLATO  Word  Mapping 
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ending  of  a  word.  More  consistent  results  can  be  achieved  by 
including  terminators  at  the  beginning  and  end  of  each  word 
in  the  digraph  mapping. 

The  assignment  of  the  syllabic  pronounc iat ion  field 
follows  closely  that  of  the  letter  order  field.  To 
approximate  syllabification,  consonant-vowel  digraphs  are 
identified,  summed,  and  hashed  into  the  field  with  a  modulo 
operation . 

A  security  field  is  used  which  contains  enough 
additional  information  to  uniquely  identify  any  string.  It 
is  used  in  the  initial  comparisons  to  assure  that  two 
slightly  different  strings  cannot  map  to  the  same  content 
word  and  hence  obscure  the  recognition  of  exact  matches. 
This  field  is  masked  out  of  the  conflict  word  before  the 
conflict  bits  are  counted. 
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Although  the  current  implementation  of  the  PLATO 
spelling  algorithm6  differs  in  detail  from  the  method 
discussed  by  Tenczar  and  Golden,  it  is  faithful  to  the 
general  model  they  proposed.  All  CDC  computers  which  run 
PLATO  have  a  word  size  of  60  bits.  Each  content  word 
occupies  two  60  bit  machine  words  divided  into  fields  as 
shown  in  Figure  14.  The  first  bits  of  both  words  are  unused. 
In  the  first  machine  word,  3  bits  are  used  to  encode  a  count 
of  the  number  of  vowels,  4  bits  contain  a  consonant  count, 
and  the  remaining  52  bits  are  divided  into  two  26  bit  letter 
content  fields.  To  preserve  some  information  about  order, 
the  first  few  letters  are  mapped  into  the  first  letter 
content  field  and  the  next  few  into  the  second.  Any 
additional  letters  are  coded  in  the  former  field.  In  the 
second  machine  word,  1  bit  indicates  whether  the  content 
word  is  representing  a  word  or  a  number.  In  the  case  where  a 
word  is  represented,  15  bits  are  assigned  to  the  security 
field,  the  first  letter  is  coded  in  6  bits,  3  bits  encode  a 
count  of  the  consonant-vowel  pairs,  and  1  bit  indicates 
capitalization.  Since  the  final  33  bits  only  hold 
information  related  to  synonym  comparison,  the  total  number 
bits  used  for  PLATO'S  spelling  algorithm  is  85. 

Before  a  misspelling  is  declared,  the  following 

criteria  must  be  met:  the  difference  in  vowel  counts  must 

not  exceed  one,  the  difference  in  consonant  counts  must  not 

6  All  information  concerning  the  current  implementation  of 
PLATO  has  been  obtained  from  personal  communications  with 
William  Golden,  head  of  the  PLATO  Services  Organization  at 
the  University  of  Illinois. 
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exceed  two,  the  number  of  conflict  bits  in  the  letter 
content  fields  must  not  exceed  five,  and  the  total  number  of 
conflict  bits  must  not  exceed  some  threshold. 

The  TICCIT  version7  of  the  PLATO  spelling  algorithm 
uses  a  three  field  content  word: 

1.  First  Character  (5) 

2 .  Digraphs  (11) 

3.  Letter  Content  (16) 

A  comparison  resulting  in  fewer  than  five  conflict  bits  is 
judged  to  be  a  misspelling.  Although  the  assignment  of  all 
three  fields  is  similar  to  the  equivalent  fields  discussed 
above,  the  value  of  the  digraph  field  is  obtained  by  ORing 
together  entries  from  a  52  (26+26)  word  table  indexed  by 
character  digraph  sums. 

Summary 

CAI  systems  can  and  have  made  use  of  approximate  string 
matching  to  identify  incorrectly  or  alternately  spelled 
words  in  student  responses.  Designers  of  CAI  systems  can 
benefit  from  existing  literature  documenting  the  research 
and  application  of  approximate  string  matching  algorithms. 

String  matching  functions  are  of  three  types. 
Equivalence  functions  divide  all  possible  strings  into  a 
finite  number  of  mutually  exclusive  sets  and  return  a  binary 
value  indicating  whether  two  strings  belong  to  the  same  set. 

7  All  information  regarding  the  TICCIT  system  was  obtained 
from  personal  communications  with  David  Stone,  Instructional 
Design  Supervisor  with  Hazeltine  Corporation. 
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Similarity  functions  are  generally  more  useful  because  they 
return  a  value  indicating  the  proximity  of  two  strings. 
Other  approximate  string  matching  functions  are  not 
symmetric  and  usually  require  some  processing  of  the  target 
word  to  be  done  by  the  course  author.  Hybrid  functions  are 
similarity  functions  where  the  similarity  relation  is  taken 
between  two  canonical  forms  representing  equivalence 


classes. 


, 


II.  Experiments  With  Approximate  String  Matching 

Originators  of  several  of  the  approximate  string 
matching  functions  described  in  Chapter  I  (Alberga,  1967; 
Blair,  1960;  Damerau,  1964;  Glantz,  1957;  Symonds ,  1970; 
Tenczar  &  Golden,  1972)  attempted  to  measure  the  accuracy  of 
the  functions  they  proposed,  and  a  few  of  them  (Alberga, 
Damerau,  Symonds)  conducted  comparative  studies  which 
examined  differences  between  functions.  The  present  chapter 
reviews  the  methods  and  results  of  these  investigations  and 
reports  on  research  by  this  author  aimed  at; 

1.  replicating  previously  published  findings 

2.  extending  similar  empirical  assessment  to  the  major 
functions  reviewed  in  Chapter  I 

3.  designing  and  assessing  improvements  upon  the  existing 
functions. 

Earlier  Studies 

All  previous  tests  of  approximate  string  matching 
functions  have  made  use  of  a  list  of  paired  strings.  Such 
lists  will  be  referred  to  here  by  the  term  word  data.  Word 
data  lists  can  be  formatted  as  two  columns  with  the  original 
strings,  hereafter  called  words,  in  the  left  column,  and  the 
corruptions,  misspellings,  or  alternate  spellings,  hereafter 
called  misspellings,  in  the  right  column.  The  convention 
followed  here  will  be  that  the  number  of  distinct  words  in  a 
word  data  list  will  be  represented  by  M,  and  the  number  of 
pairs  or  misspellings  by  N.  The  four  word  data  lists  used  by 
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the  present  author,  three  of  which  were  taken  from  studies 
described  below,  appear  in  Appendix  A.  Long  word  lists  of  K 
strings,  hereafter  called  dictionaries,  were  used  by  some 
researchers  to  measure  type  II  error. 

To  test  his  abbreviation  algorithm  BLAIR,  Blair  used  a 
word  data  list  (M=N=117,  see  Appendix  A)  taken  from  a 
secretarial  reference  book  (Hutchinson,  1956).  The  4 
character  canonical  form  of  each  misspelling  was  compared 
with  the  4  character  canonical  form  of  each  word  in  the  word 
data  list.  The  case  where  a  misspelling  matched  with  several 
words  was  resolved  by  repeatedly  incrementing  the  length  of 
the  canonical  forms  and  taking  additional  passes  through  the 
list  until  either  one  or  no  match  resulted. 

In  order  to  compare  his  single  error  matching  function 
DAMERAU  to  BLAIR,  Damerau  used  three  word  data  lists: 

Blair's  original  word  data,  word  data  gleaned  from  newspaper 
articles  during  proofreading  (M=41,N=44;  see  Appendix  A)8, 
and  a  much  larger  (M=N=964)  word  data  list  compiled  from 
errors  resulting  from  "equipment  malfunction". 

Damerau' s  method  of  comparison  was  fundamentally 
identical  to  that  of  Blair  with  the  exception  that  a 
dictionary  composed  of  K=1593  words  randomly  selected  from 
text  was  merged  with  the  correctly  spelled  words  in  the 

8  Two  word-misspelling  pairs  contained  in  the  original  word 
data  were  deleted  from  the  word  data  and  results  reported 
here:  PIPE-LINE/PIPELINE  because  most  of  the  functions 
discussed  could  be  extended  to  handle  hyphens  but  as 
proposed  do  not;  and  I ZVESTI A/I ZVESTI A ,  an  identical  pair 
unexplained  by  Damerau  which  may  have  occurred  as  a  result 
of  a  case  shift. 
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Table  11:  Success  and  Error  Frequencies  Found  by  Damerau 


Word  data 

Blair 

Damerau  I 

Damerau  II 

Function 

(N* 117) 

( N=44 ) 

(N 

=  964) 

BLAIR 

success 

89  (76%) 

32  (73%) 

240 

(25%) 

error  I 

26  (22%) 

12  (27%) 

676 

(70%) 

error  II 

2  (  2%) 

0  (  0%) 

48 

(  5%) 

DAMERAU 

success 

87  (74%) 

36  (82%) 

812 

(84%) 

error  I 

30  (26%) 

8  (18%) 

122 

(13%) 

error  II 

0  (  0%) 

0(0%) 

30 

(  3%) 

larger  (N=964)  word  data  list  to  increase  the  type  II  error 
rates  to  measurable  levels. 

Blair  and  Damerau  both  expressed  their  results  with 
three  simple  statistics:  the  number  of  correct  matches,  the 
number  of  failures  to  obtain  any  match  for  a  misspelling, 
and  the  number  of  incorrect  matches.  The  last  two  of  these 
are  related  to  type  I  and  II  errors  respectively.  Damerau' s 
results,  which  replicated  those  of  Blair,  are  summarized  in 
Table  1 1 . 

The  large  difference  between  the  type  I  error 
frequencies  of  the  two  algorithms  with  the  Damerau  II  word 
data  presumably  arises  from  the  fact  that,  while  BLAIR  was 
designed  specifically  to  recognize  corruptions  introduced  by 
humans,  DAMERAU  does  not  discriminate  between  the 
corruptions  of  different  characters.  Since  the  interest  here 
is  on  corruptions  from  human  sources,  that  is  to  say 
students,  the  results  from  the  Damerau  II  word  data  can 
probably  be  ignored. 


. 

i 

. 

1 

. 


. 


I 


55 


Alberga  compared  65  distinct  string  matching  functions. 
Of  these:  58  were  generated  from  his  13  operations  described 
in  Chapter  I;  3  were  Faulk's  material,  ordinal,  and 
positional  similarity  functions;  1  was  a  phonetic  matching 
function  of  Alberga' s  own  invention;  and  the  remaining  two 
functions  were  those  of  Blair  and  Damerau. 

As  word  data,  Alberga  used  a  sample  from  a  body  of 
misspellings  collected  by  Masters  (1927).  Masters  had  grade 
8,  grade  12,  and  senior  college  students  attempt  the 
spelling  of  268  words  dictated  to  them.  The  resulting  data 
consisted  of  a  list  of  misspellings  for  each  word  with  a 
frequency  associated  with  each  misspelling.  In  his  thesis, 
Masters  did  not  present  the  complete  set  of  misspellings, 
but  he  did  provide  a  list  of  those  which  were  most  frequent. 
This  list  was  used  by  the  present  author  and  is.  given  in 
Appendix  A. 

Alberga  sampled  Masters'  original  complete  data  by  1) 
expanding  it  to  a  list  of  word-misspelling  pairs  where  each 
pair  was  duplicated  according  to  the  frequency  associated 
with  the  misspelling  and  2)  selecting  1039  pairs  from  the 
expanded  list.  To  test  type  II  error,  an  additional  1039 
pairs  were  generated  by  pairing  each  of  the  selected 
misspellings  with  a  randomly  chosen  word  from  the  original 
data . 

The  65  matching  functions  were  applied  to  both  of  the 
lists  of  paired  strings.  For  any  given  function,  a  failure 
to  return  a  match  with  a  word-misspelling  pair  was  counted 
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as  a  type  I  error.  When  a  match  was  found  with  one  of  the 
random  pairings,  a  type  II  error  was  tallyed. 

Most  of  the  functions  tested  by  Alberga  returned  a 
value  representing  the  similarity  of  the  two  strings.  In 
these  cases  it  was  necessary  to  compare  the  similarity  value 
against  some  threshold  in  order  to  determine  if  the  strings 
matched.  Therefore,  a  similarity  function  together  with  a 
threshold  value  can  be  considered  to  be  an  independent 
string  matching  function  -  which  here  will  be  represented  as 
the  name  of  the  function  followed  by  the  threshold  value  in 
parentheses . 

Table  12  shows  some  of  Alberga 's  results.  PHONE  was  a 
phonetically  based  equivalence  function  which,  using  rules 
requiring  parsing,  attempted  to  reduce  each  string  to  a 
canonical  form  representing  its  pronounc iat i on .  As  simply 
the  sum  of  the  binary  coincidence  matrix,  the  function  SUM 
is  closely  related  to  Faulk's  material  similarity  function. 
Alberga' s  analysis  of  his  results  was  based  on  the 
assumption  that  the  two  types  of  error  are  of  equal 
importance.  The  functions  were  judged  according  to  how  well 
they  minimized  the  expression 

TERROR  I2  +  ERROR 2  2 

where  ERROR  1  and  ERR0R2  are  percentage  measurements  of  the 
two  error  types.  Following  this  reasoning,  Alberga  concluded 
that  the  ROOF-SBYC-STRING ( . 1 2 )  function  was  the  most 
successful  and  that,  due  to  relatively  large  type  I  error 
rates,  BLAIR,  DAMERAU,  and  PHONE  "failed  rather  badly". 
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Table  12:  Error  Frequencies  Found  by  Alberga 


N= 

1039 

Functions 

Err 

or  I 

Er 

ror 

II 

BLAIR 

346 

(33%) 

0 

( 

0%) 

DAMERAU 

47  1 

(45%) 

0 

( 

0%) 

PHONE 

375 

(36%) 

0 

( 

0%) 

SUM ( .71 ) 

44 

(  4%) 

44 

( 

4%) 

ROOF-SBYC-STRING( .12) 

22 

(  2%) 

21 

( 

2%) 

Symonds  tested  several  functions  using  two  sets  of  word 
data.  The  first  word  data  list,  the  results  of  a  grade  5 
spelling  test,  consisted  of  N=320  misspellings  of  M=100 
words.  The  functions  were  tested  by  comparing  every 
misspelling  with  every  word. 

Symonds  compared  her  own  function,  BLAIR,  and 
CONTEX-SBYC-PAIRS  which  she  described  as  "one  of  Alberga 's 
better  methods".  In  fact,  the  latter  function  was  never 
tested  by  Alberga.  Symonds  also  failed  to  report  the 
threshold  at  which  this  function  was  tested  and  the  type  II 
error  found  for  this  function  and  BLAIR.  The  results  which 
she  did  provide  are  shown  in  Table  13. 

The  second  word  data  list,  whose  source  was  not 
reported,  contained  108  misspellings  of  566  words,  of  which 
one  may  presume  458  were  not  paired  with  misspellings.  Five 
functions  were  tested  with  this  data:  the  3  functions  tested 
with  the  first  word  data  list,  DAMERAU ,  and  DAMERAU_SYMONDS 
which  worked  by  first  attempting  to  get  a  match  with  DAMERAU 
and  then  applying  SYMONDS  if  none  was  found.  Although  the 
method  of  testing  was  essentially  the  same  as  for  the  first 
word  data  list,  the  results  were  reported  as  in  Table  14 
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Table  13:  Error  Frequencies  Found  by  Symonds 


N  =  320 

Function 

Error  I 

Error2 

BLAIR 

201  (63%) 

CONTEX-SBYC-PAIRS 

162  (51%) 

- 

SYMONDS 

108  (34%) 

4  (  1%) 

with  type  I  and  II  error  confounded.  A  quantity  representing 
the  accuracy  of  the  function  was  incremented  when  a 
misspelling  matched  with  the  appropriate  word  but  failed  to 
match  with  any  others. 

Symonds  concluded  that  the  DAMERAU- SYMONDS  function  was 
"the  best  solution  among  those  tested  for  the  problem  of 
automatic  detection  of  misspellings". 

Tenczar  and  Golden  used  a  misspeller's  dictionary  as 
word  data  to  measure  the  type  I  error  associated  with  the 
PLATO  matching  function.  Although  they  claimed  that  this 
source  provided  "common  misspellings  in  English", 
dictionaries  of  this  type  usually  provide  only  contrived 
phonetic  misspellings  of  commonly  misspelled  words. 

Type  II  error  was  measured  by  comparing  successive 
pairs  of  words  from  a  conventional  dictionary.  These  methods 
produced  measurements  of  5%  and  14%  for  type  I  and  II  error 
respectively . 

Tenczar  and  Golden  also  used  a  rather  novel  testing 
strategy  in  which  50,000  strings  of  varying  lengths  were 
generated  by  a  random  process  involving  predetermined 
English  letter  frequencies.  All  50,0002  possible  comparisons 
between  these  strings  were  performed.  Following  the 
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Table  14:  Success  Frequencies  Found  by  Symonds 


Function 

N  =  108 

Accuracy 

BLAIR 

83  (77%) 

DAMERAU 

85  (79%) 

CONTEX-SBYC-PAIRS 

90  (84%) 

SYMONDS 

97  (90%) 

DAMERAU- SYMONDS 

102  (94%) 

rationale  that  pairs  matched  by  the  algorithm  should  look 
like  misspellings  of  each  other  to  human  judges,  these 
matching  pairs  were  printed  out  and  inspected.  Although  the 
number  of  these  pairs  was  not  given,  Tenczar  and  Golden 
reported  that  only  50  pairs  were  judged  to  be  not 
misspellings . 

Difficulties  which  plague  attempts  to  interpret  the 
collective  results  of  the  above  studies  arise  primarily  from 
differences  between  the  methods  used  in  the  individual 
studies.  For  example,  the  failure  of  the  researchers,  with 
the  exception  of  Damerau,  to  use  word  data  from  other 
studies  (and  the  failure  to  report  word  data  used)  naturally 
undermines  comparison  between  studies. 

A  factor  which  further  complicates  interpretation  for 
applications  in  CAI  is  the  tendency  of  researchers  to  base 
their  methods  on  a  data  retrieval  model  such  as  the 
following : 

1.  A  data  base  is  to  be  searched  by  the  use  of  key  words. 

2.  The  key  words  in  the  data  base  are  spelled  correctly. 

3.  A  user  performing  a  search  may  enter  misspelled 
keywords . 
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But  in  response  analysis,  the  usual  case  is  that  a 
correctly  spelled  target  word  is  compared  with  some  number 
of  misspelled  or  extraneous  input  words.  Since  most  of  the 
approximate  matching  functions  under  consideration  are 
symmetric,  this  difference  should  not  affect  the  measurement 
of  type  I  error.  However,  the  discrepancy  between  models 
does  have  implications  for  the  measurement  of  type  II  error. 

Finally,  there  seemed  to  be  general  confusion 

surrounding  the  importance  of  type  II  error  relative  to  type 

I  error.  In  most  cases,  type  II  error  was  either  not 

reported,  was  confounded  with  with  type  I  error,  or  was 

immeasureable  due  to  an  insufficiently  large  word  data  list. 

Alberga  endangered  the  usefulness  of  his  conclusions  by 

using  arbitrary  comparison  criteria  which  resulted  in 

thresholds  unsuitable  for  most  applications.  This  was  in 

spite  of  his  recognition,  as  evidenced  by  the  following 

statement,  that  an  optimal  balance  of  type  I  and  type  II 

error  is  highly  application  dependent. 

It  should  be  noted  that  a  solution  satisfactory  in 
one  area  may  not  be  satisfactory  in  another.  For 
example,  in  the  airline  systems,  searches  must  be 
made  for  the  records  of  passengers  whose  names  have 
been  misspelled  on  entry.  In  this  case,  one  is 
willing  to  retrieve  a  number  of  wrong  names  as  long 
as  the  right  one  is  among  them.  Thus  the  threshold 
may  be  set  fairly  low.  In  computer-assisted 
instruction,  on  the  other  hand,  one  may  want  to  be 
very  certain  of  the  match  before  one  tells  the 
student  that  he  is  probably  right  but  seems  to  have 
misspelled  his  answer,  rather  than  telling  him  he  is 
wrong.  In  this  application,  therefore,  the  threshold 
may  be  set  rather  high. 

This  fact  complicates  the  problem  of  comparing  algorithms  -- 
especially  considering  that  even  for  specific  applications 
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the  optimal  balance  of  errors  is  usually  not  known. 


Method 

A  program  of  research  was  undertaken,  the  goal  of  which 
was  to  evaluate  approximate  string  matching  functions  so  as 
to  determine  those  which  would  be  most  useful  in  a  CAI 
environment.  An  attempt  was  made  to  model  the  method  after 
the  following  application: 

1 .  A  number  of  target  words  have  been  entered  into  a  CAI 
system  by  various  authors. 

2.  Over  time,  the  matching  function  is  used  on  many 
occasions  to  compare  each  target  word  to  several  strings 
input  by  students. 

3.  Some  of  these  strings  are  corruptions  of  the  target 
word,  but  most  are  extraneous  words. 

Following  this  model,  words  in  the  word  data  correspond 
to  target  words,  misspellings  in  the  word  data  correspond  to 
student  corruptions  of  the  targets,  and  the  members  of  the 
dictionary  correspond  to  extraneous  words  input  by  the 
students.  Type  I  error  was  measured  as  the  proportion  of 
pairs  from  a  word  data  list  which  the  function  failed  to 
match.  Type  II  error  was  measured  as  the  proportion  of 
comparisons  between  the  words  in  the  word  data  list  and 
entries  in  a  large  dictionary  which  the  function  did  match. 

All  programs  were  written  in  Pascal  and  compiled  to 
native  code  to  run  on  a  Digital  Equipment  Corporation  VAX 
11/780  minicomputer  under  the  VAX/VMS  operating  system.  The 
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string  matching  functions  were  written  as  procedures  in 
separately  compiled  modules  and  were  called  by  various 
driver  programs  which  managed  all  input,  output  and  error 
tabulation.  Due  to  the  large  number  of  string  comparisons 
performed,  the  project  consumed  over  700  hours  of  cpu  time. 

It  was  recognized  that  the  type  of  words  and 
misspellings  encountered  by  a  string  matching  function  in  a 
CAI  environment  varies  widely  depending  on  the  students, 
content,  and  instructional  method.  For  example,  one  expects 
different  kinds  of  corruptions  or  variations  of  the  target 
word  to  arise  from  an  anatomy  course  involving  Latin  (or 
latinized)  terms  in  comparison  with  an  English  course  for 
deaf  students.  Therefore  it  was  decided  to  use  word  data 
from  several  different  sources  to  partially  guard  against 
the  possibility  of  general  conclusions  being  drawn  from 
unrepresentative  word  data. 

The  four  word  data  lists  used  are  identified  in  Table 
15  and  are  presented  in  their  entireties  in  Appendix  A.  The 
Blair,  Damerau,  and  Masters  word  data  lists  are  the  same  as 
those  described  earlier.  The  author  collected  the  Nesbit 
word  data  by  requesting  Edmonton  public  school  teachers 
(grades  2-6)  to  submit  their  students  work  in  spelling 
tests . 

The  upper  entry  in  each  cell  of  Table  15  is  either  M  or 
N  depending  on  the  column.  For  example,  the  Nesbit  word  data 
list  contains  524  misspellings  of  213  words.  The  second  line 
in  each  cell  contains  the  mean  and,  in  parentheses,  the 


' 

. 


t 

.  N  3  _  3  .  SI  .  :  0  LM  !SV1  ‘j  fl  ^ 


, 


:  *  •'  . 


63 


standard  deviation  of  the  string  length. 

The  type  of  misspellings  produced  by  an  individual  with 
a  history  of  auditory  experience  of  the  original  words,  is 
likely  to  differ  from  the  type  produced  had  the  history  been 
dominated  by  visual  experience.  Similarly,  misspellings 
occurring  during  dictation  are  likely  to  differ  from  those 
occurring  when  the  word  is  copied  from  a  visual  model  (for 
example,  when  a  handwritten  manuscript  is  typed). 
Misspellings  in  the  Masters  and  Nesbit  word  data  lists  were 
the  result  of  dictation.  The  status  of  the  remaining  two 
word  data  lists  is  not  clear. 

The  American  Heritage  Word  Frequency  Book  (Carroll, 
Davies,  and  Richman,  1971)  was  used  as  a  source  of 
extraneous  words.  It  presents  a  list  of  86,741  words  drawn 
from  1,045  500-word  samples  of  published  text  intended  for 
children  in  grades  3-9.  For  this  purpose  a  word  was  defined 
as  any  string  of  characters  bounded  by  blanks.  As  a  result, 
the  list  contains  proper  names,  abbreviations,  published 
misspellings  and  various  oddities  not  found  in  conventional 
dictionaries.  A  standard  frequency  index  (SFI)  related  to 
the  estimated  frequency  of  occurrence  in  the  lexicon  was 
calculated  for  each  word. 

The  subset  (K=21718)  of  this  list  used  in  the  present 
study  was  determined  by  taking  words  with  an  SFI  of  36.0  or 
greater,  removing  those  containing  non-alphabet ic  characters 
(except  those  containing  apostrophies  which  in  these  cases 
were  deleted  from  the  word),  converting  all  characters  to 
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Table  15: 

Word  Data  Lists 

Words 

Misspellings 

Source 

Blair 

1  17 

1  17 

common 

8.8  (2.0) 

8.7 

(1.9) 

misspellings 

Damerau 

4  1 

44 

newspaper 

8.2  (1.9) 

7.8 

(1.8) 

errors 

Masters 

179 

320 

grades 

9.1  (2.3) 

8.9 

(2.4) 

8,12,16 

Nesbi t 

213 

524 

grades  2-6 

6.1  (1.6) 

6.2 

(1.6) 

lower  case,  and  finally  removing  any  duplications  resulting 
from  the  case  shift. 

During  the  measurement  of  type  II  error,  dictionary 
words  were  tested  against  the  target  to  determine  if  they 
were  the  identical,  in  which  case  the  target-dictionary  pair 
was  not  passed  to  the  approximate  matching  function  and  an 
error  was  not  counted.  Unfortunately,  there  was  no  easy  way 
of  preventing  words  in  the  dictionary  which  might  have  been 
accepted  by  an  author  as  misspellings,  from  being  matched  by 
the  algorithm  and  counted  as  errors.  To  avoid  this  problem, 
it  was  decided  to  execute  pilot  runs  with  three  algorithms 
and  all  four  word  data  files.  The  algorithms  (SOUNDEX, 
PLANIT,  and  DAMERAU)  were  chosen  on  the  basis  of  a  priori 
assumptions  about  their  type  II  error  rates  (more  liberal 
algorithms  were  desirable),  their  mutual  distinctiveness, 
and  speed  of  execution.  The  thousands  of  word  pairs  matched 
by  the  algorithms  and  counted  as  type  II  errors  during  these 
12  pilot  runs  were  printed  out  and  inspected  for  pairs  so 
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close  in  pronounc iat ion  or  meaning  that  an  author  would 
accept  the  dictionary  entry  as  equivalent  to  the  target.  To 
be  more  specific,  the  constituents  of  a  pair  were  deemed 
equivalent  i f  : 

1.  an  orthographic  interpretation  was  possible  which 
produced  pronounc iat i ons  which  were  identical  or 
differed  by  only  minor  nuance  or  inflection  (e.g. 
size, sighs;  does , dose , doze ;  massage , message ) 

2.  their  meaning  was  close  due  to  a  common  root 
(accomodate , accomodat ion ;  f ield , af i eld )  or  they  were 
synonymous . 

As  may  be  seen  in  Appendix  A,  all  target  words  possess 
a  /  delimiter  distinguishing  an  initial  root  from  endings 
indicating  grammatical  function  and  so  on.  During  the 
measurement  of  type  II  error,  pairs  were  only  passed  to  the 
approximate  matching  function  if  their  roots  did  not  match 
exactly.  It  was  found  that  most  of  the  problematic  matches 
uncovered  by  inspection  could  be  prevented  by  moving  the 
delimiter  to  the  left  so  that  the  pair  shared  the  indicated 
root.  The  only  way  of  resolving  cases  where  the  pair  shared 
no  initial  root  or  where  the  common  root  was  so  short  that 
many  legitimate  erroneous  matches  would  have  been  blocked  by 
shifting  the  delimiter,  was  to  remove  the  dictionary  word 
from  the  dictionary  file.  The  deleted  words  are  listed  in 
Appendix  B. 

It  was  decided  that  the  two  initial  goals,  namely  the 
replication  of  previous  studies  and  the  testing  of  untried 
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algorithms,  could  best  be  served  by  a  single  experiment 
comparing  members  of  a  selected  set  of  algorithms  by  testing 
them  with  all  four  word  data  lists.  The  results  of  this 
experiment  were  then  used  to  determine  the  nature  of  a 
second  experiment  aimed  at  assessing  improved  algorithms. 


Experiment  I 


The  nine 

alg 

selected  from 

all 

the 

following 

cr  i 

1  . 

f eas ibi 1 i 

ty  o 

algorithm 

unf 

the  informati 

allow  for 

an 

2. 

inclusion 

in 

number  of 

alg 

exclusion 

of 

successful . 

3. 

apparent  ; 

pote 

point  testing 

(e.g.  Gla: 

ntz , 

The  algorithms  selected  are  listed  in  Table  16  in 
association  with  the  mean  execution  time  recorded  for  each 
word  data  list.  Of  course  these  times  are  not  the  shortest 
possible.  They  could  presumably  be  minimized  by  coding  in 
assembler  or  possibly  by  devising  a  more  efficient  high 
level  implementation.  The  Pascal  source  code  for  each 
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Table  16:  Mean  Execution  Times  in  Milliseconds 


Word  data 


Functions 

Bla  i  r 

Damerau 

Masters 

Nesbi t 

BLAIR 

2.77 

2.50 

3.17 

2.00 

DAMERAU 

0.14 

0.14 

0.11 

0.14 

SYMONDS 

4.62 

4.05 

4.38 

3.57 

DAMERAU- SYMONDS 

4.50 

3.99 

4.35 

3.34 

SOUNDEX 

1.59 

1  .76 

1  .59 

1  .48 

PLANIT 

1  .92 

2.03 

2.40 

1.69 

ROOF-SBYC-STRING 

11.51 

10.61 

12.01 

7.78 

WAGNER 

11.13 

10.27 

10.87 

10.46 

HALL 

18.21 

17.03 

18.95 

16.04 

algorithm  is  shown  in  Appendix  C. 

Generally  the  versions  of  the  algorithms  used  were  the 
same  as  those  proposed  and  tested  by  previous  authors.  Both 
BLAIR  and  SOUNDEX  generated  canonical  forms  truncated  to 
four  characters.  The  PLANIT  canonical  form  was  not 
truncated.  The  similarity  metrics  generated  by 
ROOF-SBYC-STRING ,  WAGNER,  and  HALL  were  normalized  to  range 
as  integer  values  from  0  (for  dissimilar  strings)  to  100 
(for  identical  strings).  A  vector  of  101  elements 
representing  the  thresholds  on  the  similarity  value  was 
updated  after  every  comparison  such  that  all  elements  having 
indices  greater  than  (in  the  case  of  type  II  error)  or  less 
than  (in  the  case  of  type  I  error)  the  similarity  value  were 
incremented.  After  all  the  comparisons  had  been  performed, 
the  vector  contained  the  error  frequencies  associated  with 
each  threshold. 

For  both  WAGNER  and  HALL,  the  cost  of  insertion  or 
deletion  (idcost)  was  set  at  0.5  and  the  cost  of 
substitution  (subcost)  was  set  at  0.7.  Intuitively,  a  value 
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of  subcost  such  that: 

idcost  <  subcost  <  2(idcost) 

seemed  appropriate.  Although  exceeding  the  upper  limit  would 
obviously  result  in  no  substitutions  being  performed,  the 
effect  of  going  below  the  idcost  was  not  clear.  Some  pilot 
experimentation  indicated  that  varying  the  subcost  between 
these  bounds  had  little  or  no  effect. 

HALL  was  an  extension  of  WAGNER  proposed  by  Hall  and 
Dowling  which  allowed  for  the  transposition  of  adjacent 
characters.  Following  the  present  author's  suggestion  in 
Chapter  I,  an  additional  transposition  cost  (tcost)  was 
introduced  so  that  the  complete  cost  of  such  a  transposition 
would  be: 

subcost ( i , j -  1 )  +  subcost ( i- 1 , j )  +  tcost 
where  tcost  was  arbitrarily  set  at  0.2. 

Figures  15,  16,  17,  and  18  show  the  results  for  each 

word  data  list.  Type  I  error  (on  the  vertical  axis)  is 
represented  as  a  percentage  calculated  by: 

1  00 (error  1 /compar i sons ) 

where  comparisons  =  N.  Type  II  error  (on  the  horizontal 
axis)  is  represented  as  the  mean  frequency  of  errors  per 
target  word  calculated  by: 

error2/(comparisons/K) 

where  comparisons  =  MK.  Note  therefore  that  readings  on  the 
horizontal  axis  differ  from  percent  of  type  II  error  by  a 
constant  factor  of  100/K. 


. 


' 


, 


Type  I  Error  (%) 


69 


Type  II  Error  (mean  errors/target) 


70 


71 


Type  il  Error  (mean  errors/target) 
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The  binary  functions  (BLAIR,  DAMERAU,  SYMONDS, 
DAMERAU-SYMONDS ,  SOUNDEX,  PLANIT)  appear  on  the  graphs  as 
single  points.  The  non-binary  functions  ( ROOF-SBYC-STRI NG , 
WAGNER,  HALL)  appear  as  connected  points,  each  representing 
a  distinct  threshold  setting  of  the  function.  In  Appendix  D, 
which  shows  the  complete  results  of  experiments  I  and  III, 
can  be  found  the  threshold  settings  associated  with  each 
point,  as  well  as  the  results  for  extreme  settings  not 
appearing  on  the  graphs. 

A  system  of  ranking  the  algorithms  had  been  settled  on 
a  priori  which  was  independent  of  the  relative  importance 
which  may  be  placed  on  the  two  error  types.  According  to 
this  system,  an  algorithm  could  only  be  ranked  higher  than  a 
second  algorithm  if  it  bettered  that  algorithm  in  both  error 
types  over  all  word  data  lists.  This  was  applied  to 
ROOF-SBYC-STRING,  WAGNER,  and  HALL  as  if  each  of  the 
thresholds  tested  with  these  functions  identified  a  single 
function.  In  Figure  19,  a  diagram  illustrating  the  results 
of  this  ranking,  connecting  arrows  indicate  a  bet ter=*wor se 
relationship.  Unconnected  algorithms  share  the  same  rank. 
This  diagram  shows  comparisons  between  the  binary  functions 
and  between  them  and  the  non-binary  functions,  but  does  not 
cover  the  numerous  possible  comparisons  of  non-binary 
algorithms  among  themselves. 

The  diagram  shows  that  no  one  instance  of  WAGNER  or 
HALL  is  better  than  DAMERAU,  SYMONDS,  or  DAMERAU-SYMONDS 
over  all  word  data  lists.  This  is  in  spite  of  the  fact, 
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Figure  19:  Experiwent  I  Cowparisons  Over  All  Word  Data 


1  HALL  <  80-85  )~1 


HALL(68)[  |  HALL (66-67)1  1  HALL (€5) 


WAGNER (79-81) 


* 


w 


WAGNER (64) 


BLAIR  1  |  DAMERAU  \  |  SYHONDS 


DAMERAU-SYMONDS  SOUNDEX  PLAN IT 


ROOF-SBYC-STRING ( 41-48 ) 


illustrated  by  the  4  previous  figures,  that  there  are 
instances  of  HALL  which,  on  all  word  data  lists  taken 
singly,  are  better  than  both  SYMONDS  and  SYMONDS -DAMERAU . 
The  anomaly  is  caused  by  a  shifting  of  WAGNER  and  HALL  data 
points  between  word  data  lists,  particularly  between  the 
Nesbit  word  data  and  the  others. 

Referring  back  to  the  previous  figures,  note  that  one 
of  the  instances  of  HALL  has  been  marked  with  its  threshold 
value  --  75.  Observe  further  that  this  point  remains  within 
the  rectangles  defined  by  SYMONDS,  DAMERAU-SYMONDS,  and  the 
origin  in  all  cases  except  with  the  Nesbit  word  data  list. 
Although  it  can  not  be  determined  with  certainty  which 
characteristics  of  the  Nesbit  word  data  are  responsible  for 
this  interaction,  perhaps  the  simplest  hypothesis  is  that 
shorter  string  length  produced  a  greater  increase  in  type  I 
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error  in  HALL  than  in  the  binary  algorithms. 

Experiment  II 

The  results  of  experiment  I  indicated  that  HALL  had  the 
greatest  potential  as  a  starting  point  from  which  a  very 
accurate  approximate  matching  function  might  be  produced. 
This  was  especially  apparent  considering  that,  while  it  is 
easily  modified  by  altering  the  costs  of  its  edit 
operations,  it  had  as  yet  not  been  specifically  tuned  to 
account  for  orthographically  based  misspellings  which  are  so 
common.  It  was  decided  that  HALL  would  be  modified  and 
tested  in  a  stepwise  fashion  using  the  Nesbit  word  data.  The 
modified  versions,  identified  as  HNO  through  HN7 ,  are 
plotted  in  Figure  20.  Again,  each  point  represents  the  type 
I  and  type  II  error  resulting  from  a  single  threshold 
setting . 

HNO  was  identical  to  HALL  with  the  following 
exceptions.  A  value  of  0.5  was  added  to  idcost  if  the 
character  in  question  was  in  the  initial  position.  A  value 
of  0.1  was  added  to  subcost  if  one  of  the  characters 
involved  was  in  the  initial  position  and  0.2  was  added  if 
both  were  in  that  position.  These  values  were  arrived  at  by 
inspecting  a  small  sample  of  dictionary  words  that  HALL 
erroneously  matched  by  editing  the  initial  character,  and 
calculating  the  additional  weights  required  to  lower  their 
similarity  values  below  the  threshold.  When  tested,  HNO  was 
found  to  better  the  performance  of  HALL  over  all  thresholds. 
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Figure  20:  Experiment  II  Results 
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HN 1  was  identical  to  HNO,  except  that  all  identical 
adjacent  characters  in  both  strings  were  deleted  before  the 
edit  distance  matrix  was  constructed.  Since  this 
modification  produced  no  improvement  over  HNO,  another 
version,  HN2,  was  created  which  instead  deleted  identical 
adjacent  occurrences  of  only  the  letters  1,  m,  n,  p,  r,  t. 
The  failure  of  HN2  to  produce  improved  accuracy  resulted  in 
the  abandonment  of  this  approach. 

HN3  was  similar  to  HNO,  but  idcost  was  modified  as 
follows : 

character  idcost 

e  0.3 

a  ,  i , o , u , h  0.4 

otherwise  0.5 

Although  the  values  were  selected  rather  arbitrarily,  the 
rank  order  was  based  on  statistics  gathered  by  Masters  on 
the  frequency  of  insertion  and  deletion  of  different  letters 
in  misspellings. 

Modest  improvements  produced  by  HN3  led  to  the 
development  of  HN4  which  was  identical  to  HN3  except  that 
subcost  was  modified  as  follows: 


character 

subcost 

b, f ,p, v 

0.6 

d ,  t 

0.6 

c ,  k  ,q 

0.5 

c  ,  s 

0.4 

otherwise 

0.7 

With  the  exception  that,  of  course,  the  cost  of  substituting 
a  character  with  itself  was  always  0.0  the  above  costs 


' 

. 


■ 


t 


I 


I 


78 


should  be  interpreted  such  that  the  subcosts  in  the  right 
column  apply  to  substitution  between  characters  within  the 
group  indicated  on  the  left.  HN4  produced  the  best  results 
of  all  previous  attempts. 

HN7  was  identical  to  HN4  except  that  a  lower  cost  (0.4) 
was  introduced  for  substitutions  between  vowels.  This 
resulted  in  marked  gains  over  the  performance  of  HN4 . 
Appendix  E  lists  actual  type  I  and  type  II  errors  committed 
by  HN7  with  the  Nesbit  word  data. 

Experiment  III 

HN7  was  tested  with  the  remaining  word  data  lists  to 
determine  if  it,  being  more  accurate  than  HALL,  would  have 
any  single  threshold  settings  which  bettered  DAMERAU, 

SYMONDS ,  or  DAMERAU-SYMONDS  over  all  word  data  lists.  The 
threshold  settings  found  to  satisfy  this  condition  were 
78-80,  which  bettered  SYMONDS,  and  75-77  which  bettered 
DAMERAU-SYMONDS.  No  instances  of  HN7  were  found  to  better 
DAMERAU  over  all  word  data  lists. 
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III.  Prospects  for  Approximate  String  Matching 

This  concluding  chapter  briefly  examines  how 
approximate  string  matching  functions  can  be  chosen  for,  and 
put  to  work  in,  CAI  systems.  Factors  to  be  considered  in  the 
selection  of  these  functions  are  identified.  There  is  a 
short  discussion  of  response  markup  and  dictionary  support 
services.  These  facilities  make  use  of  approximate  matching 
but  go  beyond  the  fundamental  response  analysis  problem 
which  was  the  concern  of  the  previous  chapters.  Finally, 
areas  which  would  benefit  from  the  attention  of  further 
research  are  identified. 

Selection  Factors 

Of  course,  there  exists  no  completely  determined  and 
objective  rationale  for  deciding  which  functions  are  most 
appropriate  and  how  they  are  best  implemented.  Although  a 
publicly  accepted  set  of  factors  upon  which  to  base  these 
decisions  can  probably  be  defined,  individuals  will  disagree 
as  to  the  weight  carried  by  each  factor  in  the  selection 
process.  What  follows  is  an  admittedly  non-or thogonal  list 
of  six  selection  factors  accompanied  by  comments  on  the 
implications  and  relative  importance  of  each  factor. 


Accuracy 

As  the  subject  of  the  last  chapter,  accuracy  is  the 
selection  factor  about  which  we  have  the  most 
information.  Conclusions  based  on  the  superior 
performance  of  the  edit  distance  algorithm  are  weakened 
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by  the  fact  that  several  of  the  the  functions  reviewed 
in  Chapter  I ,  including  the  PLATO  algorithm,  were  not 
tested.  Furthermore,  it  must  be  recognized  that  the 
findings  of  Chapter  II  are  dependent  on  the  word  data 
used.  There  may  indeed  exist  some  application  for  which 
BLAIR  is  more  accurate  than  HALL. 

Speed 

Not  unexpectedly,  the  execution  speed  of  an 
algorithm  was  found  to  be  inversely  related  to  its 
accuracy.  DAMERAU  turned  out  to  be  very  fast,  a  fact 
supporting  its  selection  for  use  in  operating  systems, 
compilers,  and  the  like.  Although  the  most  accurate 
algorithm,  HN7,  was  also  the  slowest,  there  are  a  number 
of  techniques  which  could  be  applied  to  shorten  the 
execution  time  of  all  versions  of  the  edit  distance 
algorithm. 

1.  Threshold  Pruning. 

In  CAI  ,  we  are  not  usually  interested  in  the  edit 
distance  per  se,  but  only  whether  it  exceeds  a  given 
threshold.  In  this  case  the  threshold  can  be  used, 
while  the  matrix  is  being  built,  to  avoid  processing 
elements  which  are  predicted  to  exceed  the 
threshold.  The  amount  of  time  saved  by  threshold 
pruning  will,  therefore,  be  dependent  on  the 
threshold  value  used. 

When  the  algorithm  can  determine  that  D[m,n] 
must  exceed  the  threshold,  then  it  is  able  to 
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immediately  return  a  no-match  result  to  its  caller. 
Adhering  to  this  principle,  the  simplest  threshold 
pruning  modification  is  to  check  each  completed 
column  (where  the  matrix  is  built  column  by  column) 
to  ensure  that  it  contains  at  least  one  element  not 
exceeding  the  threshold.  If  so,  it  moves  on  to  the 
next  column.  But  if  not,  matrix  processing  ceases 
and  the  match  fails.  Preliminary  experimentation 
with  the  word  data  used  in  Chapter  II,  indicates 
that  this  modification  can  yield  a  time  saving  of 
about  50%  for  comparisons  resulting  in  failure  to 
match . 

To  achieve  time  savings  for  successful 
comparisons,  it  is  necessary  to  employ  a  technique 
which  prunes  diagonally.  It  is  worth  noting  that 
such  techniques  will  yield  lesser  savings  when 
applied  to  algorithms  allowing  adjacent 
transposition  than  when  applied  to  those  which  allow 
only  insertion,  deletion,  and  substitution. 

2.  Length  Difference  Test. 

Since  very  few  misspellings  differ  in  length  from 
the  original  by  more  than  2  characters,  a  test  of 
length  difference  can  be  applied  before  entering  the 
edit  distance  algorithm  to  exclude  pairs  which  are 
unlikely  to  match.  Of  course,  savings  will  only  be 
realized  for  those  pairs  which  are  excluded  by  the 
test.  Preliminary  investigations  excluding  pairs 
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whose  length  differed  by  more  than  two  characters, 
yielded  mean  savings  of  about  40%  over  all 
comparisons  which  resulted  in  no  match.  No  loss  of 
accuracy  was  detected. 

3.  Content  Difference  Test. 

Following  the  same  principle  as  the  length 
difference  test,  content  fields,  similar  to  those 
used  in  the  PLATO  algorithm,  could  be  exclusively 
ORed  together  to  find  the  number  of  characters  found 
in  only  one  of  the  two  strings.  Pairs  differing  by 
more  than  some  fixed  number  of  characters  could  be 
rejected  without  further  processing.  Although  this 
test  is  likely  to  be  more  successful  at  weeding  out 
dissimilar  pairs  than  the  length  difference  test, 
whether  its  superior  discriminatory  powers  justify 
the  time  expended  in  the  generation  of  content 
fields  is  an  empirical  question. 

4.  Truncation. 

By  truncating  both  strings  to  some  maximum  length, 
an  upper  bound  can  be  imposed  on  the  execution  time. 
In  practice,  since  execution  time  increases  as  the 
product  of  the  string  lengths,  truncation  will 
probably  be  necessary  to  prevent  abuse.  Preliminary 
investigations  which  truncated  both  strings  to  six 
characters,  found  savings  of  about  30%  unfortunately 
accompanied  by  a  considerable  loss  of  accuracy.  It 
now  seems  likely  that  truncation  to  lengths  of  8-10 
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characters  would  be  more  appropriate. 

5.  Combining  Time  Saving  Techniques 

It  is  probable  that  a  combination  of  the  above 
techniques  would  result  in  the  best  performance. 
Also,  the  speed  of  DAMERAU  makes  it  a  candidate  for 
use  as  a  kind  of  pretest.  The  following  example  in 
pseudo-Pascal  shows  what  a  time  optimized  design 
might  look  like.  EDITDIST  is  a  version  of  the  edit 
distance  algorithm  which  truncates  both  strings  to 
10  characters  and  uses  threshold  pruning  while 
building  the  matrix.  Both  EDITDIST  and  DAMERAU  are 
functions  which  return  a  result  of  either  MATCH  or 
NOMATCH.  RETURN  transfers  control  back  to  the 
calling  program  with  a  returned  value  of  either 
MATCH  or  NOMATCH. 

BEGIN 

diff  :=  |LENGTH(stringa)-LENGTH(stringb) | ; 

CASE  diff  OF 
>  2 :  RETURN ( NOMATCH ) ; 

2 :  RETURN ( EDI TDI ST ( st ringa , stringb) ) ; 

<2:  IF  DAMERAU ( st r i nga , stringb )=MATCH 
THEN  RETURN ( MATCH ) 

ELSE  RETURN(EDITDIST(stringa, stringb) ) ; 

END; 

Program  Size 

Of  the  functions  reviewed  in  Chapter  I,  The  PLATO 
algorithm  with  its  500  word  dictionary  was  probably  the 
largest.  Of  those  tested,  SYMONDS  and  DAMERAU -SYMONDS 
were  the  largest.  However,  none  of  these  can  be 
considered  prohibitively  large,  and  size  can  probably  be 
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disregarded  as  a  discriminatory  factor. 

Ease  of  Implementation  and  Maintenance 

Programmers  are  expensive.  Algorithms  which  require 
a  machine  level  implementation,  such  as  the  PLATO 
algorithm,  are  likely  to  be  harder  and  more  costly  to 
install,  maintain,  and  port  to  other  systems.  The 
readability  of  the  implemented  program,  whether  it  be  in 
assembler  or  a  high  level  language,  is  also  important. 
Bugs  are  not  readily  apparent  in  programs  whose  source 
code  is,  perhaps  of  necessity,  complicated  and  unclear. 
This  factor  has  special  importance  in  approximate  string 
matching,  since  bugs  which  hamper  the  accuracy  of  the 
algorithm  may  go  unnoticed  by  users. 

Adaptability 

As  has  been  observed  earlier,  different  CAI 
applications  have  different  response  analysis 
requirements.  Approximate  string  matching  algorithms 
which  can  be  adapted  to  specific  applications  by  minor 
modification  or  by  the  passing  of  parameters,  have  a 
clear  advantage  over  those  which  cannot.  The  edit 
distance  algorithm  is  by  far  the  best  in  this  regard. 
Adjustments  in  the  relative  probabilities  of  type  I  and 
type  II  error  can  be  made  by  varying  the  threshold  value 
passed  as  a  parameter.  Adaptions  to  diverse  natural 
languages,  can  be  achieved  by  modification  of  the  edit 
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Response  Markup 

In  PLATO,  response  markup  is  a  facility  which  informs 
the  student  about  discrepancies  between  his  response  and  the 
author's  target  by  writing  special  symbols  on  the  screen 
below  the  entered  response.  PLATO  response  markup  indicates 
unanticipated  words,  words  not  in  correct  order,  and  the 
locations  at  which  words  are  missing.  The  only  information 
available  about  the  internal  form  of  a  word  is  whether  it  is 
misspelled  and  whether  the  initial  letter  should  be 
capitalized . 

The  edit  distance  algorithm  can  provide  more  complete 
support  for  response  markup.  By  tracing  the  edit  sequence 
back  through  the  matrix,  enough  information  can  be  obtained 
to  show  a  student  exactly  how  his  misspelling  can  be 
corrected.  If  small  substitution  costs  are  introduced  for 
case  shifts,  instances  of  inappropriate  letter  case  can  be 
indicated  as  well. 

Figure  21  shows  some  proposed  symbols  for  response 
markup  and  gives  some  examples  of  how  they  would  be  used. 
Correctly  spelled  words  matching  the  target  would  be 
underlined  to  distinguish  them  from  unanticipated  words, 
which  would  be  left  untouched.  Only  anticipated  correct 
responses  would  be  subject  to  markup.  It  would  seem 
pedagogically  unsound  to  provide  markup  for  anticipated 
incorrect  responses. 

An  alternative  method  of  response  markup  is  to  process 
each  word  in  the  response  after  it  has  been  entered  but 
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Figure  21:  Response  Markup 

Symbols  Examples  (Target  =  Pascal) 


t 

shift  up 

PAskal 

shift  down 

A 

insertion 

pacsal 1 

I  w  x 

X 

deletion 

It 

substitution 

V-/ 

transposition 

Pascle 

A  X 

ok 

before  the  student  signals  the  completion  of  the  response  by 
pressing  the  enter  key.  This  way,  the  student’s  spelling 
errors  are  brought  to  his  attention,  and  perhaps  corrected 
by  him,  immmediately  after  they  are  committed. 

Depending  on  the  author's  goals,  one  of  the  following 
strategies  might  be  used  after  a  misspelled  word  has  been 
detected  and  marked: 

1.  The  student  is  free  to  either  go  back  and  correct  the 
word,  or  to  ignore  it  and  continue  entering  his 
response.  When  the  enter  key  is  pressed,  all  outstanding 
misspellings  are  automatically  corrected  on  the  screen. 

2.  The  student  is  forced  to  go  back  and  correct  the 
misspelling  before  continuing.  The  cursor  is  moved 
sequentially  to  each  point  in  the  misspelling  at  which  a 
correction  is  to  be  made.  Whereas  the  strategy  in  point 
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one  may  incline  the  student  to  continue  misspelling  the 
same  word  in  later  responses,  this  method,  much  like  a 
human  tutor,  imposes  a  cost  on  spelling  errors  which  the 
student  will  try  to  avoid. 

3.  The  student  must  press  some  key  which  causes  the  word  to 
be  corrected  automatically  and  allows  him  to  continue.  A 
compromise  between  the  above  methods,  this  strategy 
imposes  a  small  cost  (one  key  press)  on  errors. 

According  to  a  recent  advertisement  in  the  popular 
microcomputing  press5,  Tenczar  and  others  have  implemented 
an  extension  of  BASIC  for  the  Apple  II  which  provides 
support  for  textual  response  analysis  including  a  response 
markup  facility  virtually  identical  to  the  one  proposed 
above.  This  implies  that  they  have  used  the  edit  distance 
algorithm,  or  something  very  similar. 


Dictionary  Support 

Several  features  can  be  ident 
desirable  additions  to  CAI  systems 
searching,  sometimes  for  an  approx 
relatively  lengthy  lists  of  words 
dictionaries.  This  section  consist 
of  these  features  followed  by  a  te 
expressing  them  as  operations  on  a 
structures . 
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9  EnBASIC  by  the  Computer  Teaching  Corporation  advertised  in 
MICRO ,  April  1983. 
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The  grammar  and  spelling  correction  capabilities  of 
modern  word  processors  would  be  extremely  useful  during  the 
creation  of  instructional  frames  of  text,  one  of  the  most 
time  consuming  tasks  of  course  authoring.  The  spelling 
correction  facilities  of  word  processors  search  for  every 
word  parsed  from  the  document  in  a  large  (10,000-100,000 
word)  dictionary  representing  the  English  lexicon.  If  a  word 
cannot  be  found,  true  spelling  correctors  present  to  the 
author  the  closest  approximation  found  in  the  dictionary, 
which  is  presumed  to  be  the  correct  spelling  of  the 
unidentifiable  word.  The  author  then  has  the  choice  of: 

1.  replacing  the  word  by  its  approximation 

2.  entering  the  word  into  the  dictionary 

3.  ignoring  the  discrepancy  and  continuing 

4.  deleting  the  word  and  trying  to  correct  it  himself. 

Another  desirable  feature  requiring  dictionary 
searching  is  an  online  source  of  standard  dictionary 
definitions,  etymologies,  synonyms,  antonyms,  and  so  on. 

Both  students  and  authors  should  be  able  to  use  this  sort  of 
facility.  When  the  word  supplied  by  the  user  cannot  be  found 
in  the  dictionary,  the  closest  approximations  found  should 
be  presented  as  alternatives  from  which  a  choice  can  be 
made . 

Returning  to  the  realm  of  response  analysis,  a  synonym 
matching  facility,  similar  to  that  supported  by  the  -vocabs- 
and  -concept-  commands  in  TUTOR,  will  be  an  important  part 
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of  future  CAI  systems10.  In  using  such  a  feature,  the  author 
specifies,  perhaps  by  default,  synonyms  associated  with 
target  words.  Student  responses  exactly  or  approximately 
matching  the  synonym  would  be  treated  as  if  they  matched  the 
original  target. 

Finally,  recall  that  in  the  PLATO  approximate  string 
matching  algorithm,  a  response  word  not  matching  exactly 
with  a  target  word  was  first  searched  for  in  a  dictionary 
containing  the  500  most  common  English  words.  If  it  matched 
exactly  with  any  words  in  this  dictionary,  it  was  rejected 
as  a  potential  approximate  match.  This  scheme  is  probably  an 
effective  way  of  reducing  type  II  error  and  can  easily  be 
applied  to  any  approximate  matching  algorithm. 

How  can  we  design  a  system  which  integrates  all  of 
these  features  as  operations  upon  a  common  set  of  dictionary 
structures?  in  Computer  Programs  for  Spelling  Correction, 
which  is  concerned  with  the  design  of  an  interactive  word 
processor  type  spelling  corrector,  Peterson  (1980)  provides 
us  with  at  least  half  the  solution.  He  noted  that,  according 
to  the  Brown  corpus  analyzed  by  Kucera  and  Francis  (1967), 
the  256  most  frequent  words  in  the  English  lexicon  account 
for  over  55%  of  specific  instances  of  usage11.  Peterson  also 
observed  that  for  any  specific  document,  a  relatively  small 
group  of  words  exist  which  occur  frequently  in  the  document 

10  Among  others,  the  WISE  authoring  system  from  WICAT  and 
EnBASIC  both  support  synonym  matching. 

1 1  This  is  a  manifestation  of  a  phenomenon  known  to 
statistical  linguists  as  Zipf’s  Law  (Zipf,  1949). 
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but  infrequently,  or  perhaps  not  at  all,  elsewhere.  This  led 

him  to  propose  the  following  searching  strategy: 

First,  search  the  small  table  of  most  common  English 
words . 


Next,  search  the  table  of  words  which  have  already 
been  used  in  this  document. 


Finally,  search  the  large  list  of  remaining  words  in 
the  main  dictionary.  If  a  word  is  found  at  this 
level,  add  it  to  the  table  of  words  for  this 
document . 


Adapting  Peterson's  proposal  to  the  problem  at  hand, 
one  might  consider  using  the  following  four  data  structures: 

o  Common  word  list  (about  256  words) 

-  contains  the  most  frequent  English  words 

-  structured  for  fast  exact  match  searching 

-  resident  in  main  memory,  or  in  a  localized 
area  of  virtual  memory 


o  Course  word  list  (less  than  2000  words) 

-  one  for  each  course 

-  contains  all  words  entered  to  text  or  glossary 

-  structured  for  exact  and  approximate  match 
searching 

-  resident  in  main  or  virtual  memory 

o  Master  Dictionary  (20,000  -  100,000  words) 

-  contains  many  English  words  (including  those 
in  the  common  words  list)  with  associated 
definitions,  synonyms,  etc. 

-  structured  for  exact  and  approximate  match 
searching 

-  controlled  by  the  system  manager 

-  resident  on  disk 

o  Glossary  (less  than  2000  words) 

-  one  for  each  course 

-  contains  words  chosen  and  defined  by  the 
course  author,  with  synonyms,  etc. 

-  structure  similar  to  master  dictionary 

-  resident  on  disk 


As  in  Peterson's  strategy  for  spelling  correction,  text 
words  entered  by  the  author  are  searched  for  first  in  the 


common  words  list,  then  in  the  course  words  list,  and 
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finally  in  the  master  dictionary.  If  this  exact  match  search 
fails,  a  search  is  performed  on  the  course  words  list  and 
the  master  dictionary  to  find  the  closest  approximation.  If 
the  author  decides  to  keep  the  unidentified  word,  it  is 
inserted  into  the  course  words  list.  He  may  decide  that  it 
should  also  go  into  the  glossary,  in  which  case  he  must 
supply  a  definition,  synonyms,  and  so  on. 

When  an  author  is  specifying  a  target  for  which  synonym 
matching  is  to  be  allowed,  synonyms  drawn  from  the  master 
dictionary  and  glossary  are  displayed.  Appropriate  synonyms 
are  then  selected  by  the  author  and  copied  into  some 
separate  structure  used  for  response  analysis. 

The  master  dictionary  and  glossary  would  be  used  by 
authors  and  students  in  much  the  same  way  as  one  would  use 
their  paper  counterparts.  Searching  with  an  approximate  key 
would  be  possible.  The  author  would  be  able  to  alter  the 
contents  of  the  glossary  at  any  time. 

In  response  analysis,  when  a  response  word  failed  to 
match  the  target  exactly,  the  common  word  list  would  be 
searched  for  an  exact  match  before  an  approximate  match 
would  be  attempted.  As  described  earlier,  if  an  exact  match 
was  found  an  approximate  match  with  the  target  would  not  be 
permitted. 

The  success  of  designs  of  the  type  described  in  this 
section  rest  on  the  efficiency  of  the  search  algorithms 
used.  Algorithms  which  efficiently  search  for  only  an  exact 
match  are  commonplace,  and  may  be  found  in  texts  such  as 


. 


- 


_ 


/ 


■ 


92 


Knuth  (1973).  There  is  no  reason  to  believe  that  the 
available  algorithms  are  not  up  to  the  task.  However, 
searches  for  approximate  matches  in  structures  as  large  as 
the  proposed  master  dictionary,  are  certain  to  be  too  slow 
for  use  in  truly  interactive  systems  if  those  searches  are 
performed  sequentially.  This  assessment  is  based  on  the 
assumption  that  the  maximum  time  period  users  can  be 
expected  to  happily  wait  for  a  search  is  somewhat  less  than 
the  time  it  would  take  them  to  do  it  manually  with  a 
paperback  dictionary. 

Fortunately,  a  few  investigators  have  proposed  methods 
which  avoid  a  sequential  search  for  an  approximate  match  by 
imposing  some  special  structure  on  the  dictionary.  Mor  and 
Fraenkel  (1982)  described  a  hash  table  implementation  of 
Damerau’s  method  which  appears  to  be  very  fast.  Kashyap  and 
Oomen  (1981)  described  an  efficient  version  of  the  edit 
distance  algorithm  which  represents  the  dictionary  as  a  tree 
structure  such  that  any  information  obtained  during  the 
evaluation  of  any  one  edit  distance,  which  may  be  relevant 
to  the  computation  of  other  edit  distances,  is  not  wasted. 

Further  Research 

Increases  in  the  accuracy  of  the  edit  distance 
algorithm  can,  no  doubt,  be  realized  by  adjusting  edit  costs 
to  more  appropriate  values  than  those  used  in  HN7 .  Some 
systematic  procedure  for  achieving  this  would  be  preferable 
to  the  rather  arbitrary  method  practised  by  the  present 
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author.  One  interesting  possibility  would  be  to  start  with 
the  edit  costs  used  in  HN7  and  introduce  an  adaptive 
mechanism  such  that,  while  calculating  the  edit  distance 
between  a  large  number  of  word-misspelling  pairs,  the  costs 
of  operations  appearing  in  minimal  cost  edit  sequences  are 
decremented  relative  to  the  costs  of  other  edit  operations. 
If  this  were  counterbalanced  by  an  incrementing  of  costs  of 
operations  occurring  in  minimal  cost  edit  sequences  between 
randomly  selected  word-word  pairs,  the  costs  might  converge 
to  optimal  values. 

Another  area  of  potential  improvement  is  the 
normalization  factor  which  converts  the  absolute  edit 
distance  into  a  value  expressed  relative  to  string  length. 

It  may  be  that  by  modifying  this  factor  to  improve  the 
robustness  of  the  algorithm  to  variations  in  string  length, 
the  shifting  of  data  points  reported  in  Chapter  II  would  be 
reduced . 

Several  other  modifications  to  the  edit  distance 
algorithm  can  be  imagined  which  might  increase  its  accuracy. 
One  could  first  parse  each  string  into  phonetically 
meaningful  lexemes  (e.g.  thr  ph ,  ght ,  C,  x) ,  and  use  these 
instead  of  characters  as  the  subjects  of  the  edit 
operations.  Another  possibility  is  to  incorporate  more 
positional  information  such  that,  for  example,  e  would  have 
a  lower  deletion  cost  were  it  in  the  final  position. 
Unfortunately,  these  sorts  of  modifications  could  not  be 
made  without  considerable  increases  in  execution  time. 
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Improvements  which  would  have  the  greatest  practical 
importance  would  be  those  which  reduced  execution  time 
without  sacrificing  accuracy.  Experimentation  with  the  time 
saving  techniques  described  earlier  would  contribute  to  this 
objective . 

As  CAI  becomes  widespread,  realistic  word  data  upon 
which  approximate  string  matching  functions  can  be  tested 
will  become  readily  available.  In  addition  to  supporting  the 
further  improvement  and  tuning  of  these  functions,  this  will 
clarify  our  understanding  of  the  degree  to  which  specialized 
functions  are  necessary  for  different  content  areas  and 
student  types. 

The  next  major  advancement  in  the  solution  of  the 
approximate  string  matching  problem  will  probably  be  the 
introduction  of  contextual  information.  This  will  mark  the 
beginning  of  an  inevitable  merger  of  this  problem  with  the 
greater  problem  of  natural  language  understanding. 
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Appendix  A.  Word  Data  Lists 


Blair  Word  Data 


absor/bent 

absorbant 

absor/pt ion 

absorbt ion 

accommodat/e 

accomodate 

acquiesc/e 

aquiese 

analy/ze 

anal ize 

antarctic/ 

antart  ic 

asinin/e 

assinine 

assist/ance 

ass  i  stence 

auxiliar/y 

aux illary 

banana/ 

bananna 

bankrupt/cy 

bankrupcy 

brethren/ 

bretheren 

br i t/ain 

br i t ian 

buoy/ancy 

bouyancy 

categor/y 

catagorey 

chauf  f eur/ 

chauf  fuer 

chimney/s 

chimn ies 

col i seum/ 

colos ium 

colossal/ 

collosal 

commi t/ment 

committment 

committee/ 

commi tee 

conced/e 

consede 

consc ien/t ious 

consc ientous 

consensus/ 

concensus 

99 


- 


( 


■ 


•/ 


100 


cont rover/ sy 

corrugat/ed 

cynic/al 

deuce/ 

develop/ 

digni/tary 

disappoint 

drastic/ally 

ecsta/sy 

embarrass/ 

exaggerat/e 

ex i st/ence 

exten/s ion 

f ebruary/ 

f i/ery 

f ilipin/os 

flammable/ 

forthright/ 

forty/ 

fulfill/ 

gnaw/ing 

govern/ment 

gramma/r 

heartrending/ 

hemor rhag/e 

hindrance/ 

hygien/e 


cont rovercy 
cor r igated 
syn ical 
duece 
devellope 
dignatary 
di sapoint 
drast icly 
ec  stacy 
embarass 
exagerate 
ex i stance 
extent i on 
f ebuary 
f  i  rey 

phi 1 ipinoes 
f lamable 
f ortr ight 
fourty 
fullf il 
knawing 
govermen t 
grammer 

heartrendering 
hemor rage 
hi nderence 
hygeine 
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idiosyncra/sy 

incens/e 

inc idental/ly 

inf all ib/le 

inoculat/e 

insi st/ence 

interced/e 

inter fer/ed 

jeopard/ize 

kimono/ 

1 icens/e 

liquef /y 

maint/enance 

manag/ement 

maneuver/ 

mortgag/ed 

nickel/ 

ninetyn in/th 

nowadays/ 

occasion/ally 

occur r/ence 

pamphlet/ 

permissi/ble 

per sever/ance 

persua/de 

philippine/s 

Pittsburgh/ 


idiocyncracy 

insense 

inc idently 

inf alable 

innoculate 

insistance 

inter sede 

inter f erred 

jeprodi se 

k imona 

1 i sence 

liquify 

maintainance 

managmen t 

manuveur 

mortgauged 

nickle 

nintynineth 

nowdays 

ocassionaly 

occurence 

phamplet 

permi ssable 

per severence 

pur suade 

phi 11 ipines 

pittsburg 
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plagiar i s/m 

playwright/ 

prairie/ 

preced/ing 

precipice/ 

prefer/able 

presumptuous/ 

privilege/ 

propel/ler 

psycholog/ical 

publ ic/ly 

pur sue/r 

questionnaire/ 

recipient/ 

relevan/t 

renown/ 

repel/ 

rhapsody/ 

rhododendron/ 

rhubarb/ 

rhythm/ 

sacr ileg/ious 

saf e/ty 

sc i ssor s/ 

se/ize 

separat/e 

shepherd/ 


pla  igar  i  srn 

playwr ite 

prar i e 

preceeding 

presipice 

pref errable 

presumptous 

pr i velege 

propellor 

psycological 

publ ically 

persuer 

questionaire 

resipient 

revelent 

renoun 

repell 

raphsody 

rhododrendon 

ruhbarb 

ry  thm 

sacreligious 
saf  ty 
si ssers 
sieze 
seperate 
sheperd 
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similar/ 

similiar 

sincer/i ty 

s  incerety 

souven i r/ 

souviner 

spec imen/ 

spec iment 

sui/ng 

sueing 

surreptitious/ 

surept i tous 

transfer/able 

t ransf er rable 

unparallel/ed 

unparalelled 

us/age 

useage 

vegetable/ 

vegatable 

Wednesday/ 

wedensday 

we i rd/ 

wierd 

Damerau  Word  Data 


abdel/ 

abdul 

aggress  i/on 

agression 

algiers/ 

algeirs 

anniversary/ 

ani versary 

antarctic/ 

antart  ic 

barabashev/ 

barbashev 

barbashov 

chiang/ 

cha ing 

Colombo/ 

columbo 

communis/t 

cmmunist 

congressm/an 

conressman 

consumer/ 

cosumer 

dalai/ 

dalal 

foreign/ 

f or iegn 

. 
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frontier/ 

f ronier 

ghosh/ 

gosh 

grotewohl/ 

grotewahl 

guerilla/ 

guerrila 

independen/ce 

independance 

jodrell/ 

jodreu 

khinzemens/ 

khimzamene 

khrushchev/ 

khrushcev 

krushchev 

khrushev 

kozlov/ 

koslov 

kuibyshev/ 

kuibishev 

long ju/ 

longnu 

mohammed/ 

mohamed 

negot iat/ion 

negoc iat ion 

philippin/es 

phillipines 

phongsaly/ 

phonsaly 

Pittsburgh/ 

pittsburg 

plebiscite/ 

plebesc i te 

prague/ 

prage 

research/ 

rearch 

satellite/ 

sattel i te 

sirimavo/ 

si r imauo 

southeast/ 

souteast 

suslov/ 

suscov 

thankhek/ 

thankek 

ulbr ight/ 

ulbr igt 

vecerni/ 

vercern i 
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vient ianne/ 

v  ientanne 

visit/ 

vist 

Masters  Word  Data 

accommodat/e 

accomodate 

accommodat/ion 

accomodation 
accomidat ion 

accompan/y ing 

accompaning 

accrue/d 

acc  rude 
ac  rude 
ac  rued 

accustom/ed 

accustom 

acustomed 

acknowledgement 

acqua intence 
acquantance 

adjourn 
a journed 

alumnae 

aimiable 
a imeable 


anniversary/ 

aniversary 

anniversery 

annoyance/ 

anoyance 

annoiance 

ant ic ipat/e 

antisipate 
ant i sapate 
anti  spate 

ant ic ipat/ing 

antisipating 
ant icapat ing 

anticipat/ion 

anti s ipat ion 
antisapation 

apolog/y 

appology 

ac  knowledg/ment 
acquaint/ance 

ad journ/ed 

alumni/ 

amiable/ 
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apparent/ 

apparent/ly 

appetite/ 

appropr iat/e 
approx imat/ely 
arrangement/s 
ascertain/ 

attach/ing 

attorney/s 

bankrupt/cy 

benef ic ia/1 

bore/d 

bungalow/ 

buried/ 

cancel/ed 
cancel/lat ion 
canvass/ 
catalog/ues 


apparant 

aparent 

apparant ly 
apperently 

appitite 
appat ite 
apet i te 

appropiate 
apropr iate 

approx imat ly 
aprox imately 

arrangment s 
arangement s 

assertain 

asertain 

accertain 

acertain 

attatching 
at tact ing 

attornies 

bankrupcy 

bankrupsy 

benef ical 
ben i f ic ial 

board 

bord 

bungalo 

hurried 

harried 

varied 

cancel 

cancelation 

canvas 

catalogue 
cat logue 
catologues 
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cease/ 

seize 

sease 

chemi st/ry 

chemestry 

collateral/ 

calateral 

commit/ted 

commi ted 
commi t tee 

committee/s 

commi t ies 
commi t tes 

commun icat/e 

comun icate 
commun i ty 

comparat i ve/ly 

compar i t i vely 

competent/ 

compi tent 
competant 

complet/ing 

complete ing 

concei v/e 

cone i eve 
conseve 

concept/ion 

consept ion 

conf i rm/at ion 

conformation 
conf  errnat  ion 

consc ienc/e 

conscious 
cone ience 

conscious/ 

cone ious 
consious 

consisten/t 

consistant 

cont inu/ous 

cont inous 
cont inious 

cont rover s/y 

cont rover sey 
cont rovercy 

conven ient/ly 

convient ly 

correspond/ence 

cor respondance 
correspondents 

coun/sel 

counc i 1 

curiosity/ 

cur iousi ty 

‘ 


■ 


. 


. 


v 


'  r 


108 


curious/ 

cour ious 
qur ious 

damage/d 

damage 

debit/ 

debt 

debet 

dec ided/ly 

die idely 

deem/ 

deam 

def ini te/ly 

def inately 
def inat ly 

delegat/es 

deligates 

delinquent/ 

del inquint 
del iguent 

deny/ 

denie 

denigh 

despair/ 

di spai r 
di spare 

determin/ed 

determine 
det i rmined 

dining/ 

dinning 

disappoint/ 

di sapoint 
di ssapoint 
dessapoint 
di ssappoint 

disappoint/ment 

di ssapointment 
di ssappointment 

discretion/ 

discression 

disgression 

divine/ 

devine 

dormitory/ 

dormatory 

dormotory 

drop/ped 

dropt 

droped 

duly/ 

duely 

dully 

dying/ 

dieing 
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edition/ 

addi t i on 

ef  f ic ien/cy 

ef  f ec iency 

elementary/ 

el imentary 
element ry 

el iminat/e 

i 11 iminate 
ell iminate 

employ/ees 

employ ies 
employes 

enem/ ies 

enimies 

enemys 

entrance/ 

enterance 

enterence 

equip/ped 

equiped 

equipt 

ere/ 

err 

esteem/ed 

esteem 
est imed 

ex i st/ence 

ex  i  stance 

exist/s 

exist 
exs i st 

exquisite/ 

exqusi te 
exqui set 
exquisi t 

f asc inat/ing 

f ac inat  ing 
fasinat ing 

folly/ 

f  ally 
f olley 

fundamental/ 

f undemental 
f undimental 

galvanize/d 

galvanize 

galvinized 

genius/ 

genious 

geneous 

gorge/ous 

georgeous 

gorgas 
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grateful/ 

great  ful 

grippe/ 

grip 

guarantee/d 

guarantee 

gauranteed 

guardian/ 

gaurdian 
guardi en 
gardian 

imi tat/ion 

immi tat  ion 
imatat ion 

immense/ly 

immensly 

imensely 

immencely 

incidentally/ 

inc idently 

inconven ien/ce 

inconvience 

inconvenien/ced 

inconvienced 

inconvenience 

indefinite/ 

indef inate 

inf  ini t/e 

inf inate 

innocent/ 

inocent 

i temiz/ed 

itemize 
i t imized 

kindergarten/ 

k indergarden 

laborator/y 

labratory 

1 i teral/ly 

li teraly 

1  i  tterally 

magni f icent/ 

magn i f icant 
magnigic ient 

mater ial/ly 

mater ialy 

mathematic/s 

mathamat ics 
mathmat ics 

melancholy/ 

melconcholy 

meloncaly 

mortgage/ 

morgage 
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myster/ious 

mi ster ious 

my ster/y 

mi st ery 
myst ry 

necessar/ily 

necessar i lly 
necessar ially 

occasional/ly 

occassionally 

occasionly 

opportun i t/ies 

opportunity 

opportunities 

ordinar/i ly 

ordinar i lly 

or iginal/ly 

or  igionally 
orginally 

pamphlet/ 

phamplet 

pamplet 

pamphalet 

pamphlet/s 

phamplets 
pamplet s 

partial/ 

pare ial 
parcel 

perce iv/e 

perc ieve 

phase/ 

faze 
f  ase 

physician/ 

phys ican 
position 
physc ian 

pos/sess 

posess 

posses 

prefer/red 

pref ered 

pre judic/e 

pred judice 
predudice 

principle/s 

principals 

prior/ 

pr  ier 
pryor 

pr ivi leg/e 

pr ivi ledge 
pr i velege 

. 
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procedure/ 

proceedure 

questionnaire/ 

quest iona ire 

rating/ 

ra te ing 
rai  t ing 

receipt/s 

rec iepts 
rece i t s 
rec ipts 

receive/r 

rec iever 

reckon/ 

recon 

reccon 

recommend/ 

reccommend 

recomend 

recommend/at  ion 

recomendat ion 
reccommendat ion 

recommend/at i ons 

r ec c omen da t ions 
recomendat ions 
reccommendat ion 

recommend/ed 

reccommended 

recomended 

recommend 

recommend/ing 

reccommending 

recomending 

reccomending 

refer/red 

ref ered 

refer/ring 

ref er ing 

regret/ted 

regreted 

rememb/rance 

rememberance 

rememberence 

representat i/ves 

represenat ives 
represent ives 
represent  it ives 

restaurant/ 

resturant 

restaraunt 

ridiculous/ 

rediculous 

romantic/ 

romat ic 

. 
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sat i sf act/or i ly 

sat  i  sf ac torly 
sat  i  sf actorally 

scandal/ 

scandle 

scandel 

schedule/d 

schedule 

seize/d 

siezed 

ceased 

solemn/ 

solomn 

solmn 

soror i t/y 

soror iety 
sarror i ty 

spec i f ic/ally 

spec i f ic ly 
spac i f ically 

specimen/ 

spec iman 

spec imen/s 

spec imans 

strenuous/ 

streneous 
st renous 

suf  f ic ient/ly 

sufficently 
suf ic ient ly 

supplement/ 

suplement 
supl iment 

temporar/y 

temperary 

tempory 

thorough/ 

through 

thourough 

thorough/ly 

throughly 

thourghly 

thouroughly 

tonnage/ 

tonage 

t raged/y 

tradegy 

tradgedy 

transfer/ red 

transf ered 

undoubted/ly 

undoubtly 

undoubtably 
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unf ortunate/ly 

unf ortunat ly 
unf ort ionately 

unusual/ly 

unusual 

unusally 

usually 

usualy 

vacanc/y 

vacency 

vancancy 

viol/ence 

violets 

violance 

virtu/e 

vertue 
vi rture 

visib/le 

vi sable 

voucher/ 

vulture 

vouture 

Nesbit  Word  Data 


accept/ed 

acceped 

acceppted 

acepted 

aspect 

eccepted 

exccepted 

excepted 

exepted 

ache/ 

aaek 

ace 

ack 

acke 

eak 

actor/ 

ac  ter 
aktter 

almost/ 

allmost 

olmoest 

olmost 

already/ 

allready 

alrday 

olrede 

olredy 

although/ 

allthough 
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all tough 
athough 

alway/s 

alaway 

alway 

alwes 

awes 

among/ 

amond 

amoug 

amoung 

amuge 

amung 

angel/ 

angle 

any/ 

enoy 

anything/ 

anething 
enething 
eni thing 

apron/ 

apr ine 

arr i v/e 

ar  i  ve 

arr iv/ed 

ar i ved 
drive 

assignment/ 

asingment 

assiment 

assinghment 

assinment 

author/ 

arthor 

ather 

athor 

aurther 

auther 

authur 

autor 

aw i ther 

away/ 

uay 

because/ 

becauce 

becaus 

becaus 

beggar/ 

baggar 

bagger 

beager 

beger 

begger 
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bugger 

pegger 

bell/y 

belley 

bley 

beneath/ 

bene i th 

blossom/s 

blossems 

bluff/ 

bulf 

book/let 

bookl i t 

bought/ 

boght 

brok/en 

brokon 

bui 1/d 

biuld 

burglar/ 

bargaler 

berglar 

bruglar 

brugller 

buglar 

burgaler 

burgeler 

burgglar 

burglor 

burgular 

bur ler 

buy/ 

by 

calculat/ or 

calclater 

calv/es 

calf  s 

capit/al 

capt ial 

captain/ 

capt ian 
capt in 

cattle/ 

cattel 

cedar/ 

ceader 

ceddar 

ceder 

cetar 

c  i  ttar 

seader 

seator 

seatter 

sedder 
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seder 

seeder 

seedor 

setar 

cent/ral 

centeral 

cent/re 

center 

centure 

centum 

certain/ 

centen 
certan 
certein 
cert  in 
serta  in 
surtain 

cheap/ 

cheep 

chemi/st  ry 

cemestery 

chemi/cal 

camecal 
cemacul 
cemecal 
cemical 
cemic le 

chic  kenpox/ 

chickenpocks 

chose/n 

chousen 

cosen 

chr i stmas/ 

cri stmas 

Cinderella/ 

c inderela 
c inder rela 
s inderela 
s indrelue 

c i rc/le 

c  icle 
c ircal 
c i reel 
curcal 

comb in/e 

combind 
combined 
comf ine 
conbine 

concert/ 

consert 

consent/ 

concent 
cone int 

<■ 
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' 


/ 
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contain/s 

contane 

contanes 

contans 

cont inu/e 

cont inie 
cont inu 
cont i u 
cont iune 

cousin/s 

cousons 

crowd/ 

crowed 

cruel/ 

crool 

crual 

crule 

curtain/ 

certen 
cert ian 
curtane 

daily/ 

dayly 

dance/r 

danncor 

dazzl/e 

dazzel 

dec  id/ e 

dec ied 
deside 
die ide 

deserv/e 

desurve 

destr/oy 

di stor 

di  f  f icult/ 

di f  f acllt 
di f  f uclt 

does/ 

dase 

dose 

dous 

doesnt/ 

dosent 

donor/ 

doanor 

donar 

donear 

doner 

donner 

donnor 

dooner 

dozen/ 

dosen 

. 


drawer/s 

.  draws 

drove/ 

drov 

ear 1/y 

eraly 
er  ly 

e i ghty/ 

eagi ty 
eigty 
ete  i 

elect r/ic 

elect rec 
eletr ic 

endless/ 

endlees 

engage/ 

agage 

enga jde 

engange 

ingade 

ingage 

ingaged 

erupt/ion 

erapt ion 
eropsion 
eropt ion 
erupshon 

eskimo/s 

eskamos 

exerc i s/e 

excerc i se 
exersis 
exersize 
exrsise 

fairy/ 

f  aiy 
f  ary 

f avor/i te 

f aver i t 
f avor i t 

f ield/ 

f  ei  Id 

f  i  r/e 

f  rie 

fire/d 

f  ride 

f  o/ur 

for 

fore 

friend/ 

f reind 

fur/ 

fir 

- 


' 
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f urni/ture 

f urnitchure 

garden/ 

gardin 

geolog/y 

gedgily 

geoligy 

geologey 

gologay 

gingerbread/ 

gigrbred 

grammar/ 

crammer 

gramar 

gramer 

grammer 

grammor 

gramor 

gretel/ 

gret  1 
gret tel 

half/ 

haf  f 

hansel/ 

hancel 

hansl 

hasel 

hav/ing 

haveing 

head/ 

ha  id 
ha  id 
hed 

he/ar 

here 

herd/ 

heard 

hered 

hurd 

hour/ 

huor 

includ/e 

enc lude 

instead/ 

insted 

instrument/ 

insterment 

interest/ed 

int rested 

junior/ 

juinor 

junier 

ladder/ 

latter 

. 
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less/on 

leson 

1 iber/ty 

1 i baty 
libraty 

1  imb/ 

1  im 

1 iqui/d 

liqud 

1 iqued 

magic/ 

magce 

major/ 

mager 

magor 

manger 

man/y 

manay 

meny 

maple/ 

mapel 

massag/e 

masach 

masaga 

masage 

masoge 

masosh 

masuage 

masuage 

mesash 

mi  sage 

musoshe 

musouge 

maybe/ 

manbe 

mayor/ 

ma  iar 
ma  i  re 
major 
mangor 
mar r ior 
ma  r  r i r 
mar r or 

measle/s 

mesels 

militar/y 

mi  letary 
mi llytary 

mirror/ 

mearor 

mere 
mieeor 
mi r r ier 
mirrior 

] 


. 


I 
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model/ 

modle 
mot  le 

moist/ 

most 

motor/ 

modor 
motar 
moter 
motter 
mot tor 

motor/ ized 

motor i sed 

muddle/d 

mudald 
muddeld 
mudduld 
mudled 
mut tied 

mutton/ 

meten 
moten 
muten 
mut ten 

necessar/y 

necasary 

necc i sary 

nessacary 

nessacery 

nessasry 

nesseary 

nessecery 

nesseray 

necklace/ 

neclace 

nerv/ous 

nerves 

nickel/ 

n ic  kle 

ninety/ 

nighty 
nintey 
ni  ty 

north/ern 

northeren 

ocean/ 

oacen 
ot  ion 

omit/ 

omeit 

operator/ 

operater 

opereator 

oporator 

* 


. 


i 
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oporeator 

opperator 

opporator 

opreator 

pajama/ s 

pe jamas 

pas/ s 

parss 

pas/sed 

paste 

pass/age 

pasag 

pasage 

pasige 

pas/te 

paset 

past 

percent/ 

per  sent 

perfume/ 

per f oum 
per  f um 

pervum 
p i r  f  um 
pur f um 
pur  fume 

period/ 

par iod 
peareaid 
pear  id 
per  id 
per ied 
per ieod 
per i ode 
perioid 
pi  red 

phon/e 

phown 

piece/ 

pe  ice 

pimples/ 

pimpls 

pimppal 

pimppleas 

pimpules 

planet/ 

pannet 
plani  t 

pleas/e 

plaes 

poe/m 

poum 

poe/ 1 

poety 

i 


. 
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poi  t 
pouet 
powi  t 

polar/ 

pal  or 

poler 

poller 

polor 

polor 

possib/le 

posible 

prince/ 

pries 

princes/s 

pr ines 

pr iz/e 

prise 

probab/ly 

probibly 

purpos/e 

perpus 

puzzl/e 

puzzel 

qua/ke 

qauk 

quacke 

qui/te 

quit 

rac/ing 

race ing 

raze/ 

raies 

rais 

raise 

rase 

rays 

read/y 

rede 

redte 

redy 

real/ly 

realy 

regain/ 

regan 

regane 

rifle/ 

rifal 
r  i  f  el 
r i f iel 

right/ 

write 

rough/ 

rof  e 
rof  f 

. 
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ruff 

rug 

ruf/f 

rof  f  e 
rouf 
rouf  e 
rough 
ruf 
ruf  e 

sailor/ 

sa i lar 
sailer 
sailor 
salor 

scen/e 

sene 

sentenc/e 

sentance 
sent ince 

sett 1/ed 

seteld 

seteled 

set led 

setteld 

setteled 

settld 

settles 

several/ 

sevaral 

severel 

sevral 

sevrall 

shin/y 

shiney 

shovel/ 

shovle 

shr/ank 

shrak 

shranke 

simpl/e 

s i mmpa 1 
simpaill 

size/ 

sies 

si ic/e 

slise 

soccer/ 

socer 

sour/ 

sawr 

sror 

squ/are 

saer 

‘ 
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squirrel/ 

squi r le 

steer/ 

stear 
st  ire 

stoop/ 

stup 

stor/y 

stor ie 

st r/i ke 

srake 
str  iek 

success/ 

succees 

succes 

succese 

sucess 

sucsses 

susses 

success iv/e 

secseveve 
secs i ve 
sicsec iv 
sicsesof 
sicsesove 
succes ive 
succesof 
successuf 
succesuve 
succeuf e 
suckseof 
sue  seccof 
sucsesive 
sucsesof 
sucsesof e 
sucsesseve 
sucsessi ve 
sucssevive 
sugees i ve 
suges i ve 
sussisive 

sure/ 

sur 

surviv/ing 

servi ving 
surf i ving 
surrving 
surving 
survi veing 
suvi ving 

ta/le 

tai  le 

televi s/ion 

tel i vi s ion 
tellevision 

“ 


■ 


■ 
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term/s 

turns 

throat/ 

throte 

through/ 

thour 

throgh 

throu 

thumb/ 

thoum 

too/ 

to 

toward/ s 

towords 

towel/ 

taule 

towle 

troll/ 

t  role 

tunnel/ 

tunel 

turtle/ 

turtel 

val/ley 

vail iy 

view/ 

ve  i  w 
vieu 

visitor/ 

vi  sator 
visiter 
visittor 
vi ster 
vi stor 

we i gh/ 

wiegh 

which/ 

whi tch 
wich 

whit/e 

whit 
whi tte 
wiht 
wi  te 

who/se 

whoes 

wise/ 

wase 

wies 

with/ 

whi  th 

wor s/t 

wor sest 

.  . 

. 
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wrist/ 
wr it/ing 


r  i  st 

righting 
wirt ing 
wright ing 
wr i tt  ing 


/ 
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Appendix  B.  Words  Deleted  From  Dictionary 


afire 
af ield 
ahead 
air 

a i rplanes 
airy 

anesthetics 

anne 

annie 

awe 

chang 

dale 

daley 

daly 

delay 

dose 

doze 

drive 

ehre 

er 

er  i  e 

fer 

fez 

fiery 

fife 

georges 


message 

modal 

modeled 

our 

paced 

paz 

peace 

peas 

radii 

ra  i  se 

raising 

rally 

ray 

relay 

rely 

Salem 

seen 

seine 

sen 

shanty 

shines 

shore 

sighs 

sin 

skeptical 

stir 


. 


' 
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Appendix  C.  Pascal  Source  Code  for  the  Algorithms  Tested  in 

Experiment  I 


const  length  =  80; 

type  chars  =  0.. length; 
string  =  record 

line  :  ARRAY [chars]  OF  CHAR; 
len  :  chars 
end ; 

stringlist  =  ar ray [ 1 .. length ]  of  string; 

function  stest ( arg 1 , arg2  :  string)  :  boolean  ; 

var  i  :  integer; 

begin 

stest  :=  true; 
if  argl.len  <>  arg2.1en 
then  stest  :=  false 
else  for  i  :=  1  to  argl.len  do 

if  arg1.1ine[i]  <>  arg2.1ine[i] 
then  stest  :=  FALSE; 

end; 

procedure  delete  (var  str  :  string;  pos  :  integer); 

var  i  :  integer; 

begin 

for  i  :=  pos  to  str. len  do  str.line[i]  :=  str . line [ i+ 1 ] ; 

str. len  :=  str.len-1; 

end; 

procedure  truncate  (var  str  ;  string;  pos  :  integer); 

var  i  :  integer; 

begin 

if  str.len>pos  then 
begin 

str. len  :=  pos; 

for  i  :=  pos  downto  1  do 

if  str . 1  ine [  i  ]  =  '  '  then  str. len  -:=  i-1; 

end; 

end; 

procedure  delchar 

(var  str  :  string;  c  :  char;  pos  :  integer); 

var  i  :  integer; 

begin 

for  i  :=  str. len  downto  pos  do 

if  str . line [ i ]=c  then  delete ( str , i ) ; 

end; 

procedure  deladj  (var  str  :  string); 

var  i  :  integer; 

begin 

for  i  ;=  str. len  downto  2  do 

if  str . line [ i ]  =  str  .  line [ i- 1  ]  then  delete ( str , i ) ; 
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end; 

procedure  BLAIR  (a,b  :  string;  var  r  ;  integer); 
var  tlen  :  integer; 

procedure  code  (var  str  ;  string); 
type  vector  =  array  [1.. length]  of  integer; 
string9  =  packed  array[1..9]  of  char; 
var  i , j , k , di f f , maxpos  :  integer; 
charscore  :  vector; 
wstr1,wstr2  :string9; 
begin 

for  i  :=  1  to  str.len  do 
case  str.line[i]  of 


'  d '  , 

q 

'  *  Y  '  • 
r  A 

charscore [ i ] 

:=  0; 

'  b '  , 

'f  ' 

k 

' ,'m' , 

'  v'  , 

’  w'  ,  ' 

z 

'  . 

• 

charscore [ i ] 

=  1; 

? g' , 

'  y '  : 

charscore [ i ] 

=  2; 

'n'  , 

'  p'  /  ' 

t 

«  . 

• 

charscore [ i ] 

=  3; 

'  o'  , 

T  r  !  ! 

L  r 

u 

*  . 

• 

charscore [ i ] 

=  4; 

’a'  , 

fc'  ,' 

h 

f  *  I  T 

/  X  f 

's': 

charscore [ i ] 

=  5; 

'  i '  : 

charscore [ i ] 

=  6; 

'  e '  : 

charscore [ i ] 

=  7; 

otherwise 

write( ' BAD  STRING' ) ; 

end; 

wstrl  :=  '024556667'; 
wst r 2  :=  '  134556677'  ; 
i  :=  1; 
j  :=  str.len; 
while  i<=j  do 
begin 

if  i<9  then  k  :=  i  else  k  ;=  9; 

charscore[i]  :=  charscore[i]  +  ( ord ( wst r 1 [ k ] ) -48 ) ; 
if  i<j  then 

char score [ j ] ; =charscore [ j ]+ (ord (wstr 2 [ k ] ) -48 ) ; 
i  :  =  i  + 1 ; 
j  :=  j  - 1 ; 
end; 

diff  :=  str.len  -  tlen; 
for  j  ;=  1  to  diff  do 
begin 

maxpos  : =  1 ; 

for  i  :=  1  to  str.len  do 

if  charscore[i ] >=char sc ore [maxpos ] 
then  maxpos  :=  i; 
delete(str  r maxpos ) ; 
for  i  ;=  maxpos  to  str.len  do 

charscore[i]  :=  charscore[i+1]; 

end; 

end; 

begin 

tlen  ;=  4; 
code (a ) ; 


. 


- 


' 


' 
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code ( b) ; 

if  stest(a,b)  then  r  :=  1  else  r.:=  0; 
end; 

procedure  DAMERAU  (a,b  :  string;  var  r  :  integer); 
var  m, n , di f f , errorcount , f i rster ror , laster ror  :  integer; 
procedure  match  (length  :  integer); 
var  i  :  integer; 
begin 

errorcount  :=  0; 
firsterror  :=  0; 
lasterror  :=  0; 
for  i  ;=  1  to  length  do 
begin 

if  a.lineti]  <>  b.line[i]  then 
begin 

errorcount  :=  errorcount  +  1; 
if  errorcount=1 

then  firsterror  :=  i 
else  lasterror  :=  i; 

end ; 

end; 

end; 

begin 

m  :=  a.len; 
n  :=  b.len; 
diff  :=m-n; 

if  (diff  <  -1)  or  (diff  >  1) 
then  r  :=  0 
else  case  diff  of 
0:  begin 

match (m) ; 
if  errorcount  <  3 
then 

case  errorcount  of 
0,1:  r  :  =  1  ; 

2:  if  (firsterror  =  lasterror-1) 

and 

(a . 1 ine [ f irsterror  ]  =  b . 1 ine [ laster ror ] ) 
and 

( b . 1 ine [ f i rsterror ]  =  a . 1 ine [ laster ror ] ) 

then  r  :=  1 
else  r  :=  0; 

end 

else  r  :=  0; 

end; 

- 1 :  begin 

match (m) ; 
if  errorcount  =  0 
then  r  :=  1 
else 

begin 

delete(b, firsterror); 


. 

‘ 
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ma tch (m) ; 
if  errorcount  =  0 
then  r  :=  1 
else  r  :=  0; 

end ; 

end ; 

+ 1 :  begin 

match ( n ) ; 
if  errorcount  =  0 
then  r  :=  1 
else 

begin 

delete(a,firsterror) ; 
match ( n ) ; 
if  errorcount  =  0 
then  r  :=  1 
else  r  :=  0; 

end; 

end ; 

end ; 

end ; 

procedure  SYMONDS  (a,b  :  string;  var  r  :  integer); 
var  i , j ,asize , bsize  :  integer; 
alist,blist  :  stringlist; 
procedure  parse 

(str  :  string;  var  codelist  :  stringlist; 

var  size  ;  integer); 
var  h , i , j , k , newsize  :  integer; 

procedure  nplicate(n  ;  integer); 

var  p  ;  integer; 

begin 

newsize  ;=  n  *  size; 
for  p  :=  1  to  n-1  do 

for  k  :=  1  to  size  do 
begin 

for  h : = 1  to  codel i st [ k ] . len  do 

codel i st [ k  +  si ze*p] . 1 ine [h ]  := 
codelist [k] . line[h] ; 

codel i st [ k  +  s ize*p] . len  :=  codelist [ k ]. len ; 
end; 

k  ;  =  1  ; 
end; 

procedure  catlist(c  :  char); 
begin 
j  :  =  k  ; 

while  j  <  k+size  do 
begin 

codel i st [ j  ]. len  :=  codel i st [ j ]. len  +  1; 
codel i st [ j ]. 1 ine [codel ist [ j ]. len ]  :=  c; 

j  :=  j  +  1? 

end; 
k  :=  j; 

if  k>newsize  then 


* 


■ 
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begin 

size  :=  newsize; 
k  :  =  1  ; 
end ; 

end ; 
begin 
i  :=  1; 
k  :=  1; 
size  :  =  1  ; 
newsize  : =  1 ; 
codel i st [ 1 ] . len  :=  0; 
str . line [ str . len+ 1  ]  :=  ' 
while  i  <=  str. len  do 

if  (i>1)  and  ( str . line [ i ]=str . line[ i- 1 ] ) 
then  i  :=  i  +  1 
else 

case  str.line[i]  of 
’ b ' :  begin 

c  a  1 1  i  s  t  (  '  B '  )  ; 
i  :  =  i  + 1  ; 
end ; 

'  c  ?  : 

case  str  .  line [ i  +  1 ]  of 
' h ' :  begin 

npl icate ( 2 ) ; 
catl i st ( ' s '  )  ; 
c a 1 1 i s t ( '  C ' )  ; 
i  :=  i  +  2; 


end ; 

k'  : 

begin 

cat 1 i st ( ' K ' 

) ; 

i  :=  i+2; 
end; 

q’  : 

begin 

catlist ( ' Q' 

); 

i  :  =  i  +  2 ; 
end 

otherw i se 

begin 

npl icate ( 2 ) ; 
catl i st ( ' S '  )  ; 
c  a  1 1  i  s  t  (  ’  K  ’  )  ; 
i  :  =  i  + 1 ; 
end 

end; 

T  d '  : 

case  str  .  line [ i  +  1 ]  of 

' g ' :  begin 

npl icate ( 2 ) ; 
catl i st ( ' J '  ) ; 
cat 1 i st ( ' G ' ) ; 
i  :=  i+2; 
end 

otherw i se 


' 
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begin 

cat 1 i st ( ' D ’ ) ; 
i  :  =  i  + 1  ; 
end 

end; 

f ' :  begin 

c  a  1 1  i  s  t  (  '  F  '  )  ; 
i  :  =  i  + 1 ; 
end; 

9'  : 

case  str . 1 ine [ i  +  1  ]  of 
' h ’ :  if  str . 1 ine [ i  +  2  ]  =  ' t 1 
then 

begin 

npl icate ( 2 ) ; 
catl i st ( ' F ' ) ; 
k  :  =  1  ; 
c  a  1 1  i  s  t  (  '  T '  )  ; 
cat 1 i st ( ' T ' ) ; 
i  :  =  i  +  3  ; 
end 

else 

begin 

npl icate ( 2 ) ; 
cat 1 i st ( ’ F ’  )  ; 
catlist ( ' G' ) ; 
i  :  =  i  +  2  ; 
end; 

' n ' :  begin 

c  a  1 1  i  s  t  (  ’  N '  )  ; 
i  :  =  i  +  2  ; 
end 

otherwise 

begin 

npl icate ( 2 ) ; 
catl i st ( ' J ? ) ; 
catlist ( 'G' ) ; 
i  :  =  i  + 1  ; 
end 

end; 

h' :  begin 

c  a  1 1  i  s  t  (  '  H '  )  ; 
i  :  =  i  + 1 ; 
end; 

j ' :  begin 

cat 1 i st ( ' J ' ) ; 
i  :  =  i  +  1  ; 
end; 

k':  if  str . line[ i+ 1 ]= ' n ’ 

then 

begin 

catlist ( ’ N? ) ; 
i  :=  i+2; 
end 


I 
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else 

beg  i  n 

catlist ( ’ K ' ) ; 
i  :  =  i  +  1 ; 
end ; 

' 1 ' :  begin 

catlist('L' ); 
i  :  =  i  +  1 ; 
end ; 

?  m '  :  begin 

c  a  1 1  i  s  t  ( '  M '  ) ; 
i  :  =  i  + 1 ; 
end ; 

' n ' :  begin 

catlist ( ' N' ) ; 
i  :  =  i  + 1 ; 
end ; 

T  p T :  if  str . line[ i  +  1 3  =  ' h' 

then 

begin 

c  a  1 1  i  s  t  ( '  F  '  )  ; 
i  :=  i+2; 
end 

else 

begin 

c  a  1 1  i  s  t  ( '  P '  ) ; 
i  :  =  i  + 1  ? 
end; 

'  q '  :  begin 

catlist ( ' Q ' ) ; 
i  :  =  i  + 1 ; 
end; 

'r':  if  str . line [ i+ 1 ]  =  ' h' 

then 

begin 

catlist ( ’ R ’ ) ; 
i  :=  i  +  2; 
end 

else 

begin 

catlist ( ’ R ’ ) ; 
i  :  =  i  + 1  ; 
end; 

's': 

case  str . line[ i+1 ]  of 
'c':  if  ( str . line[ i+2 ]= ' i 

( str .line [ i+2 ]= ' e ' ) 
then 

begin 

nplicate ( 2 ) ; 
catl i st ( ' S ' ) 
catlist ('s') 
i  :  =  i  +  3 ; 
end 


'  )  or 


f 

f 


' 


' 

. 
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else 

begin 

npl icate ( 2 ) ; 
c  a  1 1 i s  t ( 'S' ) ; 
catlist('Z' ); 
i  :  =  i  +  1  ; 
end ; 

' h ' :  begin 

catlist('s' )  ; 
i  :  =  i  +  2  ; 
end 

otherwise 

begin 

npl icate ( 2 ) ; 
catlist('S' ); 
catlist('Z'); 
i  :  =  i  + 1 ; 
end 

end ; 

case  str . line[ i+1 ]  of 

'c'  :  if  str. line [i  +  2]  =  'h' 

then 

begin 

npl icate ( 3 ) ; 
cat 1 i st ( ' s ' ) ; 
cat 1 i st ( ' C ' ) ; 
cat 1 i st ( ' K ' ) ; 
i  :  =  i  +  3 ; 
end 

else 

begin 

cat 1 i st ( ’ T' ) ; 
i  :  =  i  + 1 ; 
end; 

’ i ’ :  if  st r . 1 ine [ i +2 ] = ' o ' 

then 

begin 

npl icate ( 2 ) ; 
catl i st ( ' C ' ) ; 
catli st ('s' ) ; 
i  :  =  i  +  3  ; 
end 

else 

begin 

catlist ( 'T' ) ; 
i  :  =  i  + 1  ? 
end; 

' h ’ :  begin 

catlist('t'  ); 
i  :=  i+2; 
end 

otherwise 

begin 


~ 


. 


■ 
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c  a  1 1  i  s  t  (  '  T '  )  ; 
i  :  =  i  +  1  ; 
end 

end ; 

' w ' :  if  i<>st r . len 

then 

if  str.line[i+1]='h' 
then 
begin 

catl i st (  '  W' )  ; 

i  :  =  i  +  2  ; 

end 

else 

begin 

c  a  1 1 i s  t ( ’  W ' ) ; 
i  :  =  i  +  1  ; 
end 

else  i  : =  i+ 1 ; 

' x ' :  begin 

npl icate ( 2 ) ; 

cat 1 i st ( ' K ' ) ; 

k  :  =  1  ; 

catl i st ('S'); 

c  a  1 1  i  s  t  (  '  G '  )  ; 

size  :=  size  div  2; 

k  : =  size  + 1 ; 

c  a  1 1  i  s  t  (  '  Z  '  )  ; 

i  :  =  i  +  1  ; 

end ; 

' v ' :  begin 

cat 1 i st ( ' V' ) ; 

i  :  =  i  + 1 ; 
end ; 

’ y ' ;  if  i=1 

then 

begin 

c  a  1 1  i  s  t  (  '  Y '  )  ; 
i  :  =  i  + 1  ? 

end 

else  i  :  =  i  +  1 ; 

' z ' :  begin 

catli st ( ' Z ' ) ; 
i  :=  i  +  1  ; 
end 

otherwise  i :=i+1 
end ; 

end; 

begin 

parse(a,alist,asize) ; 
parse (b,blist , bsize) ; 
i  :  =  0  ; 
r  :  =  0 ; 

while  (r=0)  and  (i<asize)  do 
begin 
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i  :  =  i  + 1 ; 
j  :=  0; 

while  (r=0)  and  (j<bsize)  do 
begin 

j  :  =  j  +  1  ; 

if  stest (alist [ i ] , blist [ j ] )  then  r 
end; 

end; 


=  1? 


end ; 

procedure  DAMERAU-SYMONDS  (a,b 
begin 

DAMERAU ( a  ,  b  ,  r  ) ; 

if  r=0  then  SYMONDS ( a , b , r ) ; 

end; 


procedure  SOUNDEX  (a,b  :  string;  var  r 
procedure  code  (var  str  :  string); 
var  i  :  integer; 
begin 

for  i  :=  2  to  str.len  do 
case  str.line[i]  of 

?^f  f^T  1  n  T 

d  r  Zz  r  1  r  V  f 

’u' , 'h' , '  w' , ’yT : 


string;  var  r 


integer ) 


'  b ’ 
’c’ 

'q' 
?  d  ’ 
?1' 
’m’ 
’  r  ’ 


f 

g 

s 

t 

n 


P 

j  < 

x 


v 

k 

z 


st r . 1 ine [  i  ] 
str . line [ i  ] 

str . 1 ine [  i  ] 
str . line [  i  ] 
str . line [  i  ] 
str . line [  i  ] 
str . 1 ine [  i  ] 


'  A  ’  ; 
’  B  T  ; 

’C' 
TD' 
TLT 
’  Mf 
'  R' 


otherwise  writeln('BAD  STRING  '); 
end ; 

del char ( str , ’ A ' ,2) ; 
delad j ( str ) ; 
truncate ( str , 4 ) ; 
end ; 
begin 
code (a ) ; 
code ( b ) ; 

if  stest(a,b)  then  r  :=  1  else  r  :=  0; 
end; 

procedure  PLANIT  (a,b  :  string;  var  r  :  integer); 
procedure  code  (var  str  :  string); 
var  i  :  integer; 
begin 

for  i  :=  1  to  str.len  do 
case  str.line[i]  of 
'  a  ’  ,  '  e '  ,  '  i  '  ,  1  o’  , 

’u’r'y’:  str.line[i]  :=  'A'; 

’ b ' , ’ f ' , ' p' , ' v ’ :  str.linefi]  ;=  ’B’; 

»  r,  I  *  n  I  *  b-  * 

u  r  y  r  J  /  r 

' q' , ' s ' , ' x ' , ' z ' :  str.line[i]  :=  ' C ’ ; 


integer ) ; 


‘ 


f 

5  •.  ; 
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d'  , 

t '  : 

st r . 1 ine [  i  ] 

:=  'D' 

h '  , 

w'  : 

str  .  1 ine [ i ] 

:=  '  H ? 

1 '  : 

str . line [ i  ] 

:=  ’  L ? 

m'  , 

'  n '  : 

str . line [ i  ] 

:=  ’M' 

r  1  : 

st r . 1  ine [  i  ] 

:=  '  R' 

otherwise  write('BAD  STRING  '); 
end ; 

del char ( st r , ' H ’ ,2)  ; 
delad j ( str ) ; 
delchar ( str , ' A ' , 1 ) ; 
end; 
begin 
code (a ) ; 
code ( b ) ; 

if  stest(a,b)  then  r  :=  1  else  r  ;=  0; 
end ; 


procedure  ROOF-SBYC-STRING  (a,b  :  string;  var  r  :  integer); 
var  m,n,i,j,k  ;  integer; 
rtemp  :  real; 

coin  :  array [ 1 .. length f 1 .. length]  of  real; 
function  max ( arg 1 , arg2  :  integer):  integer; 
begin 

if  argl  >  arg2  then  max  :=  argl  else  max  :=  arg2; 
end; 

procedure  roof; 
begin 

if  (m= 1 )  or  (n=  1 ) 
then 

for  i  :=  1  to  m  do 

for  j  :=  1  tondo 
coin [ i , j ]  : = 

coin[i,j]  *  ( 1 -abs ( i/m- j/n ) ) 

else 


for  i  :=  1  to  m  do 

for  j  :=  1  to  n  do 
coin [ i , j ]  : = 

coin[i,j]  *  ( 1-abs ( ( i- 1 )/(m- 1 )- ( j- 1 )/(n- 1 ) ) ) 

end; 

procedure  sbyc; 

var  sel,baki  :  integer; 
flag  :  boolean; 

begin 

for  i  :=  1  to  m  do 
begin 
sel  : =  1  ; 

for  j  :=  1  to  n  do 

if  coin[i,j]  >  coin[i,sel] 
then 


begin 

flag  :=  false; 
for  baki  :=  i-1  downto  1  do 
if  coin[baki , j ]<>0 

then  flag  :=  true; 


' 


' 
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if  flag=false  then  sel  :=  j; 
end; 

for  j  :=  1  to  n  do 

if  j<>sel  then  coin[i,j]  :=  0; 

end ; 

end; 

procedure  stri; 
begin 

rtemp  :=  0; 

for  i  :=  1  to  m  do 

for  j  :=  1  to  n  do 
if  coin [ i , j ]<>0 
then 

begin 
k  :  =  0  ; 

while  (i+k<=m) 
and  (j+k<=n) 

and  (coin [ i+k , j+k ]<>0 )  do  k  :=  k+1; 
rtemp  :=  rtemp  +  coin[i,j]  *  k; 
end ; 

k  : =  max (m, n ) ; 

k  :=  (k**2+k)  div  2;  (*  normalizing  divisor  *) 

r  :=  round ( 1 00* ( rtemp/k )) ; 

end; 

begin 

m  :  =  a  .  1  e  n ; 
n  :=  b.len; 
for  i  :=  1  to  m  do 

for  j  :=  1  to  n  do 

if  a . 1 ine [ i ] =b . 1 ine [ j ] 
then  coin [ i , j ]  : =  1 
else  coin[i,j]  :=  0; 

roof ; 
sbyc ; 
stri; 
end; 

procedure  WAGNER  (a,b  :  string;  var  r  :  integer); 
var  i , j , m , n , min , di f f  :  integer; 

di st , idcost , sml , sub , ins , del  :  real; 
d  :  array  [ -  1 . .  20  ,  -  1  .  .  20  ]  of  real; 
function  subcost  :  real; 
begin 

if  a.line[i]  =  b.line[j] 
then  subcost  :=  0.0 
else  subcost  :=  0.7; 

end; 

begin 

idcost  :=  0.5; 
r  :  =  0 ; 
m  :=  a.len; 
n  :=  b.len; 

if  m  <  n  then  min  :=  m  else  min  :=  n; 


* 
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di f f  :  = 
d[ 0 , 0  ] 
fori  : 
f or  j  : 
fori  : 
for 


( *  The 


abs (m-n ) ; 


0 


1  to  m  do  d[i,0]  :=  d[ i- 1 , 0 ] +idcost ; 
1  to  n  do  d[0,j]  :=  d[ 0 , j- 1 ]+idcost ; 
1  to  m  do 
j  :=  1  to  n  do 


begin 

sub  :=  d[i-1,j-1]  +  subcost; 

ins  :=  d[i-1,j]  +  idcost; 

if  ins<sub  then  sml  :=  ins  else  sml  := 

del  :=  d[i,j-1]  +  idcost; 

if  del<sml  then  sml  :=  del; 

d  [  i ,  j  ]  : =  sml ; 

end ; 

worst  case  edit  distance  is: 


sub; 


min(m,n)  *  subcost 
+ 

di f f (m, n )  *  idcost 


which  can  become  a  normalizing  factor  to  get  a  metric 
between  0  and  100. 

* ) 

dist  : =  d[m, n ] ; 

r  :=  round ( 1 00* ( 1 -di st/(min*subcost+di f f * idcost )))  ; 
end ; 

procedure  HALL  (a,b  :  string;  var  r  :  integer); 
var  i , j ,m, n ,min , di f f  :  integer; 

inf , di st , idcost , tracost , sml , sub , ins , del , tra  :  real; 
d  :  array  [ -  1  . . 20 , - 1 . . 20 ]  of  real; 
function  subcost(i,j  :  integer)  :  real; 
begin 

if  a.line[i]  =  b.linetj] 
then  subcost  :=  0.0 
else  subcost  :=  0.7; 

end ; 
begin 

idcost  :=  0.5; 
tracost  :=  0.2; 
r  :  =  0 ; 
m  :=  a.len; 
n  :  =  b .  1  e  n ; 

if  m  <  n  then  min  :=  m  else  min  :=  n; 
di f f  : =  abs (m-n ) ; 
inf  :=  m  *  idcost  +  n  *  idcost; 
d [ 0 , 0 ]  :=  0; 
d  [  - 1  ,  -  1  ]  :  =  0 ; 
d[0,-1]  :=  inf; 
d[-1,0]  :=  inf; 
for  i  :=  1  to  m  do 
begin 

d[i,0]  :=  d [ i  —  1 , 0 ]  +  idcost; 
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for 


for 


d [ i , -  1  ]  :=  inf; 
end; 

j  :=  1  to  n  do 
begin 

d[0,j]  :=  d[0,j-1]  +  idcost; 

d  [  -  1  ,  j  ]  :=  inf; 

end; 

i  :=  1  to  m  do 
for  j  :=  1  to  n  do 
begin 

sub  :=  d[i-1,j-1]  +  subcost ( i , j ) ; 
ins  :=  d[i-1,j]  +  idcost; 

if  ins<sub  then  sml  :=  ins  else  sml 
del  :=  d[i,j-1]  +  idcost; 

if  del<sml  then  sml  :=  del; 
if  ( j> 1 )  and  ( i>  1 ) 
then 


sub; 


begin 

tra  :=  d[i-2,j-2]  +  subcost ( i- 1 , j )  + 
subcost ( i , j -  1 )  +  tracost; 
if  tra<sml  then  sml  :=  tra; 
end; 

d [  i  ,  j  ]  : =  sml ; 
end ; 

dist  :=  d [ m , n ] ; 

r  :=  round(100  *  (1  -  dist/(min  *  0.7  +  diff  *  idcost))); 
end ; 


. 

- 


- 


Appendix  D.  Results  of  Experiments  I  and  III 

The  following  table  shows  type  I  and  type  II  error 
values  recorded  for  the  algorithms  tested  in  experiments  I 
and  III  with  all  four  word  data  lists.  It  is  structured  as  a 
matrix  with  word  data  lists  comprising  the  horizontal  axis 
and  algorithms  comprising  the  vertical  axis.  Each 
experimental  event  is  represented  in  the  table  by  five 
values  in  the  following  manner: 

(frequency  type  I  error)  (frequency  type  II  error) 
(proportion  type  I  error)  (mean  frequency  type  II  error) 

(proportion  type  II  error) 

The  proportion  type  I  error  and  mean  type  II  error  frequency 
were  calculated  as  described  on  page  68  and  were  graphed  in 
figures  15  through  18. 

Recalling  that  the  non-binary  algorithms  were 
programmed  to  return  an  integer  between  0  and  100  inclusive, 
note  that  each  threshold  set  on  that  returned  value  is 
represented  here  as  an  independent  algorithm.  Threshold 
settings  are  indicated  in  the  leftmost  column  underneath  the 
name  of  the  algorithm  to  which  they  apply.  Comparisons 
returning  a  value  greater  than  or  equal  to  the  threshold 
were  defined  as  resulting  in  a  match.  To  save  space,  the 
data  reported  here  range  from  the  highest  threshold  setting 
at  which  no  type  I  error  were  found  to  the  lowest  threshold 
at  which  no  type  II  errors  were  found. 

To  take  an  example,  with  a  threshold  setting  of  75,  HN7 
produced  the  following  results  when  tested  with  the  Nesbit 
word  data: 


84  1483 

0.160  6.962 

3.2  1  605E-04 

This  means  that  84  word-misspelling  pairs  failed  to  yield  a 
value  greater  than  or  equal  to  75,  and  that  these  84 
constituted  16%  of  such  pairs  in  the  Nesbit  word  data.  Of 
all  comparisons  between  words  from  the  Nesbit  word  data  and 
words  from  the  dictionary,  1483  failed  to  yield  a  value  less 
than  75.  These  constituted  about  .03%  of  all  such 
comparisons  and  represent  a  mean  of  about  7  matches  in  the 
dictionary  for  every  word  in  the  word  data  list. 
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Appendix  E.  Examples  of  Errors  by  HN7 

The  following  are  instances  of  type  I  and  type  II 
error  committed  by  HN7  when  tested  with  the  Nesbit  word 
data.  Correct  spellings  appear  in  the  left  column,  the 
corresponding  misspellings  appear  in  the  middle  column, 
and  the  value  returned  by  HN7  for  each  word-misspelling 
pair  appears  in  the  right  column.  All  type  I  errors  are 
given  here,  but  only  a  small  randomly  selected  portion 
of  the  total  number  of  type  II  errors  are  presented. 

Type  I  Error 


accept/ed 

aspect 

56 

ache/ 

aaek 

48 

ac  k 

66 

eak 

31 

actor/ 

aktter 

67 

already/ 

olrede 

64 

alway/s 

awes 

66 

among/ 

amuge 

68 

any/ 

enoy 

62 

arr iv/ed 

drive 

58 

author/ 

awi ther 

74 

away/ 

uay 

59 

beggar/ 

pegger 

71 

cedar/ 

c i t tar 

65 

seader 

72 

seator 

56 

seatter 

52 

sedder 

63 

seder 

71 

seeder 

67 

seedor 

67 

setar 

66 

certain/ 

centen 

70 

Cinderella/ 

sindrelue 

70 

eighty/ 

ete  i 

44 

engage/ 

agage 

72 

172 


- 


. 


I 


173 


ingade 

69 

geolog/y 

gedgi ly 

58 

grammar/ 

c  rammer 

73 

ladder/ 

latter 

73 

major/ 

mager 

71 

manger 

63 

massag/e 

masach 

68 

masosh 

68 

mesash 

68 

musoshe 

67 

musouge 

71 

mayor/ 

ma  iar 

71 

ma  i  re 

68 

mangor 

72 

mar r i or 

67 

marr i r 

63 

mar r or 

72 

mirror/ 

mere 

56 

mieeor 

69 

mutton/ 

meten 

70 

moten 

70 

ninety/ 

nighty 

69 

ocean/ 

ot  ion 

61 

pas/sed 

paste 

72 

pimples/ 

pimppal 

69 

prince/ 

pries 

74 

raze/ 

raies 

67 

rais 

58 

ra  i  se 

69 

rays 

55 

read/y 

rede 

69 

redte 

63 

right/ 

write 

42 

rough/ 

rof  e 

53 

rof  f 

50 

ruff 

50 

. 
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ruf/f 

rouf 

71 

route 

69 

rough 

50 

simpl/e 

simpaill 

73 

size/ 

sies 

71 

sour/ 

sawr 

65 

sror 

71 

squ/are 

saer 

73 

successi v/e 

secseveve 

69 

sicsec iv 

74 

sicsesof 

65 

succesof 

74 

succeuf e 

71 

suckseof 

64 

sucseccof 

65 

sucsesof 

71 

term/s 

turns 

7  1 

through/ 

thour 

67 

towel/ 

taule 

66 

wrist/ 

r  i  st 

72 

wr it/ing 

righting 

67 
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Type  II  Error 


ache/ 

ace 

86 

achmet 

76 

acre 

77 

apache 

78 

arched 

76 

archer 

76 

arches 

76 

ash 

76 

ashes 

75 

cache 

81 

anything/ 

nothing 

78 

bought/ 

bl i ght 

76 

bough 

88 

boughs 

84 

bright 

76 

brought 

90 

fought 

80 

nought 

78 

ought 

77 

sought 

78 

cattle/ 

battle 

78 

cale 

76 

castle 

84 

castles 

76 

rattle 

78 

Seattle 

80 

settle 

76 

chemi/st ry 

cemetery 

75 

geolog/y 

ecology 

78 

gloomy 

76 

theology 

75 

instrument/ 

installment 

77 

instant 

75 

investment 

77 

measle/s 

males 

83 

maples 

80 

mass 

76 

masses 

80 

meals 

83 

measures 

81 

medals 

76 

mesas 

79 

messages 

75 

metals 

76 
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mi  les 

75 

mi rac les 

77 

moles 

75 

moses 

75 

moslems 

77 

mules 

75 

muscles 

77 

mussels 

75 

weasels 

77 

possib/le 

impossible 

80 

passable 

86 

permissible 

77 

puzzl/e 

muzzle 

78 

rough/ 

rug 

76 

trough 

77 

si ic/e 

silence 

79 

silica 

81 

since 

76 

slide 

82 

slime 

82 

solace 

81 

spice 

82 

wr i t/ing 

awa i t ing 

75 

rewriting 

79 

waiting 

87 

wanting 

75 

warming 

75 

warning 

75 

warring 

75 

wasting 

75 

wetting 

75 

whirling 

77 

whirring 

77 

wiping 

76 

wiring 

86 

wising 

76 

wording 

77 

working 

75 

wrestling 

77 

wring 

81 

wringing 

79 
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